Uploaded by roberto.rainone

Handouts - Financial Econometrics

advertisement
Handouts for the course
”Financial Econometrics and Empirical Finance Module I”
Francesco Corielli
August 26, 2020
It is perhaps not to be wondered at, since fortune is ever changing her
course and time is infinite, that the same incidents should occur many
times, spontaneously. For, if the multitude of elements is unlimited, fortune
has in the abundance of her material an ample provider of coincidences;
and if, on the other hand, there is a limited number of elements from which
events are interwoven, the same things must happen many times, being
brought to pass by the same agencies.
Plutarch, Parallel Lives, Life of Sertorius.
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
Ronald Fisher
Il est vrai que M. Fourier avait l’opinion que le but principal des matématiques était l’utilité publique et l’explication des phénomènes naturels;
mais un philosophe comme lui aurait du savoir que le but unique de la science, c’est l’honneur de l’esprit humain, et que sous ce titre, une question
de nombres vaut autait q’une question du système du monde.
Carl Gustav Jacobi (letter to Adrien-Marie Legendre, from Konigsberg,
July 2nd, 1830)
1
Among the first examples of least squares: Roger Cotes with Robert Smith, ed.,
“Harmonia mensurarum”, (Cambridge, England: 1722), chapter: “Aestimatio errorum
in mixta mathesis per variationes partium trianguli plani et sphaerici”, pag. 22.
An hypothesis should be made explicit: no systematic bias in the measuring instruments. Thomas Simpson points this out in: “An attempt to shew the advantage
arising by taking the mean of a number of observations in practical astronomy” (From:
“Miscellaneous Tracts on Some Curious Subjects ...”, London, 1757). In modern terms
this shall become E(|X) = 0 (see section 9).
2
Introduction
This course aims to offer students a selection of probabilistic and statistical applications
compiled according to a twofold criterion: they should require the introduction of only
few new statistical tools and make use as much as possible of what the student should
already know; they should be, as far as possible given the introductory level of the
course, "real world" tools, that is: they should represent simplified versions of tools
really, and sometimes heavily, applied in the markets.
The course also aims to achieve a much more difficult task: trying to show how
probabilistic and statistical thinking can actually be useful for understanding and surviving markets.
Historical experience tells us that the first aim is achieved for the vast majority
students, and this is enough for getting a good grade at the end of the course.
About the more ambitious aim: it shall be for each student in this course how much
the study of this topic was useful for the student’s professional life.
This is a Master course and it presumes some knowledge of Probability and Statistics from BA courses.
Being a course for a Master in Finance in Bocconi University, the kind of previous
knowledge which is given for granted is based on the related BA programs in Bocconi.
These program are similar to the programs for BAs in Economics and Finance in
most Universities around the world.
In any case, for student coming from other Universities, the syllabus and the preliminary course program should give a clear idea of which notions shall be given for
granted in this course.
The teachers of this course are fully available to help any student with suggestions
for readings which could be useful to complete preliminary knowledge.
Experience tells us that, not infrequently, students do not remember at once everything they indeed did study (and maybe get a good grade for) during their BA.
However, that memory of past knowledge easily comes back when necessary.
For an even better understanding of what is given for granted (Statistics and Probability wise) a summary of the relevant definitions and theory is added as an appendix
to these handouts (appendix 19). This is not to be intended as a standalone text in
basic Statistics and Probability: all students should read this in order to be able to
check their level of knowledge in basic Statistics and ask for help to the teachers in
case of problems.
Basic notions of matrix algebra are also required and are summarized in sections 6
and 7 and in appendix 18.
The main theoretical tools introduced in this course which may be new for most
students are:
3
• some non parametric Statistics useful for value at risk computations
• an introductory but rather complete treatment of the multivariate linear model
• principal components analysis
Using these tools, and more basic notions of Probability and Statistics, the course
describes applications in the fields of:
• value at risk
• factorial risk models for asset allocation
• style analysis
• efficient portfolio analysis
• performance evaluation
• Black and Litterman asset allocation
Most of the examples and applications described in the course shall make use of stock
market data. This choice is not due to an assumed relevance of this market (it constitutes a rather small fraction of the overall financial market) but to the fact that dealing
with more relevant markets (interest rates, exchange rates, commodities etc.) requires
institutional and technical knowledge that cannot be assumed at disposal of the course
students. In any case, with the proper changes, most of what discussed in this course
can be and is applied in almost all kinds of financial markets.
A note on this version of the handouts
This is a further revised version of the course handouts.
This is work in progress and, hopefully, it shall remain so.
Old errors and typos are corrected but new one still may creep in.
I’ll be grateful for any suggestion, correction, comment and I am grateful for the
many past suggestions and corrections by students and colleagues.
These handouts receive an update each year in September at the beginning of the
course.
In the September 2019 update the only changes of some import are: a rewriting of
the example in section 9.12.11 and of the first two subsections of chapter 11 joint with
some relabeling and renumbering of chapter 11 subsections.
The September 2020 update only corrects some errors and reformats some formulas.
4
This text was and is used for several different courses. Hence, a number of sections
and subsections of these handouts are not required for the 20191 course exam. This is
detailed in the course syllabus and specified in the text when necessary.
In particular, Chapter 20, which was part of the course in academic year 2014/2015
and chapter 21 are no more in the program of the 20191 course.
Probability, Statistics and Finance
Modern Finance studies are broadly divided in two (interconnected) fields: asset pricing
and Corporate Finance.
Asset pricing studies the observed statistics on financial securities prices (“price” is
here broadly intended as any kind of data concerning the value of an asset, could be a
price or a rate, a yield or an exchange rate etc).
The aim of the study is twofold.
From the empirical point of view, we wish to characterize and summarize the joint
distribution of observed prices and its evolution. This is of paramount practical importance in the field of asset management and risk management. Due to the quality,
detail and observational frequency, financial data do require (and allow) more advanced
statistical techniques than those required for dealing with standard economic data. On
the other hand, for the same reason, the practical usefulness of the results obtained
from these data is in most cases greater by several degrees of magnitude, compared
with what we can do with standard economics problem.
An empirical trace of this is in the fact that Finance is the only applied field
of Economics where rules, laws, conduct codes etc are mostly written in terms of
mathematical and statistical models.
From the theoretical point of view, the aim is to connect the observed financial
data with the overall evolution of the economic system. In this, Finance has most in
common with Economics.
Finance, however, can exploit, in many cases, stronger connections between observable (that is: prices) that standard Economics.
Most financial assets of interest are traded in fairly liquid and ordered markets.
For this reason traders must, as a rule, at least approximately avoid/exploit simple
arbitrage trades.
The avoidance of arbitrage, if implemented by a relevant part of market agents (and
this may not be the case) imposes strong functional constraints to the price system.
Sometimes it is even possible to find exact or almost exact functional relationships
among security prices (“arbitrage pricing”) which are largely independent on a detailed
modelling of agents behaviours.
Only when stronger results are required, for instance results connecting financial
prices with the general behaviour of an economy and, most difficult, when we look for
5
the financial implications of economic policy acts, the study of Finance must resort
to the usual Economics practice of deriving price systems from hypotheses on agents
behaviour.
This is a very difficult effort. A fortunate fact is that, for most practical market
applications, this, at least at a first level of approximation, can be avoided.
Corporate Finance, in its origins, mostly deals with the capital structure of a firm as
connected with its investments. The original problem was that of characterizing those
investments which could be valuable for the firm and how to finance these1 . During
time, further problems were considered by the field, mainly problems concerned with
organizational and managerial issues. How to choose and compensate management,
how to implement strategies which are viable for the stakeholder, when and why to
reorganize the firm, and so on. Due to these developments, modern Corporate Finance
has much in common with organization theory and, in particular with the field of
industrial organization.
Most research in the field of Corporate Finance has a strong empirical twist being
concerned on the consequences of financing, organizational and investment decision on
the ”value” of the firm. As a consequence, huge databases containing corporate data
have been build in the last 20 years and the statistical applications to corporate study
have grown to and amazing size.
The Reader should notice that by “applications” here it is not intended “academic
applications”, any Corporate Finance act: issue of new stock, bond issues, IPOs, M&A
operations, investment planning etc is strongly based an a quantitative analysis of the
relevant data.
For these reasons there is not much need for justifying the presence of (several)
courses in applied Probability and Statistics within a graduate curriculum in Finance.
A very direct proof could be a visit to any trading floor, participation to meetings
for an M&E operation, spending some time in the risk management office of a bank
or simply a reading of laws and regulations concerning the management of financial
companies.
The recent emphasis on “big data”, when not debased to a fad, further confirms the
central role of data analysis and data modelling in the genera financial field.
Another possible way to understand the role of quantitative methods in Finance
(and in financial regulation) could be browsing through the program of institutional
exams required in order to relate with clients in international markets.
Among these, just consider the FINRA Registered Representative levels in the USA.
But the simplest and most direct way, at least in our opinion, is to point out the
fact that most of Finance has to do with deciding today prices for economic entities
It could be interesting to notice that, until recently, what exactly is a “firm”, why it should exist
and in which aspects a “firm” in economic theory is similar to what is called “firm” in common language,
has not been really clear. See on this point the introduction of Jean Tirole, (1988), “The theory of
industrial organization”, MIT Press.
1
6
whose precise future value is unknown. In such a field, the availability of a language
for speaking about uncertain events is, obviously, a necessity.
Up to present time, the most successful language devised with such a purpose
is Probability Theory (competitors exist but lag far behind both in popularity and
practical effectiveness).
It is interesting to notice that the language of Probability Theory, intimately conjugate to the statement of prices for betting on uncertain events, is more directly
appropriate for dealing with uncertainty in the field of Finance, where bets are actually made, than, say, in Physics, where the problem is not (at least at a prima facie
level) that of betting on uncertain results but, maybe, that of describing the long term
frequencies (a very non-empirical concept) of experimental results2 .
Probability and Statistics
Financial contracts bearing strong similarities to modern contracts were traded well before Hammurabi code was sculpted (in fact an article of the code deals with a particular
option contract).
Similar contracts were the basis for shipping ventures in classical Athens and lated
Rome.
Such contracts were quite common and were priced without problems by anciant
financiers. Well, non exactly without problems as many examples of such contracts
came to us in the writings of famous lawyers/orators like Demostenes.
The use of Probability in Finance, however, is much more recent and can be seen
as a direct offspring of the classical origin of Probability in the context of gambling.
However, gambling problems are usually much easier to deal with than, say, security
pricing problems. The reason for this is that in most "games of chance" two elements
are usually agreed upon.
First, the nature of the game is such that the probabilities of its results are agreed
upon by the vast majority of participants, usually via symmetry arguments (equal
probability of each side of a coin, or of drawing any given card, etc.).
Second, in typical situations the betting decisions of players do not change the
probability of results (we are speaking about games of chance the like of roulette trentequarante, rouge et noir, dice games and the like. In most card games the element of
chance is in the card dealing and is then mediated by a strategic element in the card
play phase. This makes analysis much more complex).
The consequence of the first point is that, in typical games of chance, Statistics,
as a tool for choosing probabilities, is not required (while it could be required in other
betting settings, the like, e.g., of horse racing, football matches etc.).
See the appendix on pag. 202 for a summary of a definition of probability based on betting
systems and and its connection with frequencies.
2
7
Probability theory has noting to say about the "right" probabilities to assign to
possible events. Its field is the consistency (no arbitrage) among probability statements,
whose numerical values do not origin in Probability Theory itself (except for obvious
cases like the probability of the sure or impossible event) and, to a lesser degree, the
interpretation of such statements.
As mentioned, the basic input required by Probability theory, namely the probabilities of simple events, are agreed upon in most games of chance, where almost
everybody agree on the validity of simple symmetry arguments from which numerical
values of probabilities are derived3 . Maybe these symmetry arguments are justified by
some putative set of past observations, however, the agreement is so widespread that
it could be possible, if still wrong, to tag as "wrong" probability assessments which
disagree with the majority’s. In this sense an inference tool for deriving probability
estimates from, say, past data, is not directly required for gambling4 .
In words we are used to when considering financial risk management, we could
say that in chance games there is no model and estimation risk: numerical values of
probabilities for possible events can well be considered as given and “correct”, at least
in the sense that almost everybody agrees on these.
This is not true in the Finance milieu. In the case of financial prices, simple
symmetry arguments are, as a rule, not applicable and, for this reason, we are often in
need for estimation of such values using past observations and models. In other words,
we need Statistics, with the implied estimation risk. Moreover the statistical models
we use are not “common knowledge”. They are not derived by simple, agreed upon,
symmetry arguments like those used in most games of chance. This “uncertainty” about
models is called “model risk”.
Let us pass to the second point: independence of probabilities on strategic behaviour
of players.
The fact that, say, the future price of an asset is directly dependent on the bets
made on it by traders (non necessarily in any ordered or “rational” way) is a mainstay
of Finance (and Economics). It embodies the complex interaction between judgment
of values and expectations of prices which, through the concept of equilibrium, is both
the stuff of economic theory and of day by day work in the markets.
This interaction by itself contributes in determining the probabilities of market
events, probabilities which cannot at any time be considered as "given" as those in
typical games of chance.
This interaction is not only complex, but also subject to change and usually does
not satisfy symmetry arguments so that it cannot, with the exception of very simple
A great probabilist: Laplace, believed that this way of computing probability by symmetry was,
or should be, possible in any sensible application of the concept of probability, excluding in this way,
for instance, the application of Probability to horse racing betting.
4
It must be said that some Statistics is actually used for periodical checking of the "randomness"
of chance generating engines used in gambling (the like of roulette, fortune’s wheel or dice)
3
8
contexts, be ignored (as we ignore, e.g. in the rolling of a die, the complex but stable
and symmetry justifiable physical model of its chaotic rebounds on a hard surface) and
solved with a simple symmetry induced probability model (for the die: each side has
1/6 probability to turn up).
Consider the hypothetical case of a die where each face has a weight depending, in
an unknown way, by the amount of money gamblers play on it where gamblers tend
to choose their bets on the basis of numerological considerations depending on their
humour and, maybe, of observed past result in the rolling of the same die.
Arguably, different analysts shall choose different models for this, and the different
choices shall have consequences on the behaviour of market agents. This complexity is
another source of “model risk” typical of the study of Finance.
Models, moreover, shall contain unknown quantities: “parameters”, and we shall
need to estimate these. Any estimate, based on rules of thumb or on best Statistics,
shall imply a possible error, this is called “estimation risk” and the main difference
between rules of thumb and good Statistics is the ability of the second of quantifying
such risk.
In the end, this is what makes the field of Finance so different (and much more
difficult) than the study of games of chance. We do not only run risks implied by the
fact that results of bets are uncertain, we are also uncertain on how to model such risk
(model risk) and we need data in order to estimate parameters in the models we decide
to use (estimation risk).
In principle it is possible to separate the two aspects that make financial markets and
gambling casinos different. It is possible, and, in fact, it has been done in the past (for
instance by Yahoo Finance) to create fictional markets where "stocks" with absolutely
no economic meaning are "traded" between agents and future prices of these stocks are
determined (typically through an auction system) by the amounts "invested" in them
by players. It is interesting to notice that in such contexts, where the "true value" of
each share is in fact known to be 0 and the aim of the game is only that of moving ahead
of the flock, prices follow paths which are very similar, qualitatively, to those observed
in real financial markets. This should be instructive for understanding how, even when
the traded securities have a real (if numerically not known) economic meaning and
value, the simple interaction of agents in the market can create an evolution of prices
partially independent on such value5 .
Economists hope that such “partial independence” is not too strong. In fact, financial markets
have the relevant role of allocating investments among different economic endeavors in some “efficient”
way, where “efficient” should mean, roughly, that investments with better prospects should receive, at
least on average, more money. The question is whether markets are a setting in which this happens
or whether the market induced “noise” can overwhelm any value “signal”. The history of market crises
contains a rich set of clues about an answer to this question and we can say that, at least in some
cases, the answer may be in the negative (but we are at a loss if we are asked for some systems different
than market and able to be “right” at least as frequently).
Notice that even in casino games, where the “value” of each game (the probability of each possible
5
9
Ultimately, the decision of how much to bet on a given future scenario requires
both an assessment of economic value AND an evaluation of the consequences of the
interacting opinions of agents. This is a very difficult task, which, as we said above,
cannot rely on simple symmetry arguments of the like used to "agree on probabilities"
in standard games of chances.
Tools are required for economic evaluation and tools are required for connecting
past observations of market and, more in general, economic events to the statement of
probabilities useful for choosing among actions whose results depend on future events.
In the financial field this makes Probability intimately connected with Statistics and
more in general with Economics.
A Caveat. From what we wrote here it could be deduced that the business of
Probability and Statistics in Finance is that of forecasting future prices. If by this is
meant forecasting the exact value of a future price, this would be a wrong deduction.
Instead, if by the term forecasting we intend the assessment of probability distributions for future prices, we get a clearer picture of what we intend to do. Standard
introductions to Finance theory stress this point by describing in a very simplified way
an investment decision as a choice among risk/expected returns pairs. More advanced
analysis describe investment as a choice among probability distributions of future returns.
Antidotes to delusions
There is another relevant reason for the study of Probability and Statistics during
a financial oriented training. Financial markets are “full of intrinsic randomness” in
the sense that since we do not possess, and with all probability shall never possess,
result times the payoff of each result) is known, frequencies of bets fluctuate, together with players
whims, and it may be possible that, on some gambling tables and for some numbers or colors, we
observe a concentration of bets which is totally unjustified by some anomalous probability of the
coveted result but that may, all the same, last even for considerable time. (The Reader should think
about The huge literature on “late numbers” in bingo, lotto and similar games).
As we wrote above, financial markets are casinos with the added twist that probabilities of outcomes
are not known and that outcome themselves depend on the opinions and hopes of players. The resulting
mess is, then, fully understandable, we do not like this but, alas, a better tool (or, at least not worse)
for allocating investments among uncertain prospects is still to be discovered!
To conclude this footnote on a positive tone we must say that the study of modern financial markets
has an advantage wrt the study of modern economic systems. While irrational, arbitrage ridden,
behaviour is possible in both settings, at least in “normal” times modern financial markets tend to
punish arbitrage allowing investors in such a quick and (financially) harmful way that propensity for a
coherent assessment of one’s own personal bets is a strong point (we repeat: in normal times) of most
big investors. In more general economic situations, such punishment is not so quick, it can typically be
made to burden on the decision maker’s offspring or on other people hence it does not bind too much
decisions. In other words the financial setting is a privileged setting in Economics at least because we
can assume that most of times the agent behaviour may be stupid but not irrational.
10
tools which allow us to forecast the future with precision, we must learn to live in an
environment of unresolvable uncertainty.
The human mind does not seem to adapt well to environments of this kind. Each
time the future value of variables is relevant for us but we are not able to determine
it, either by forecasting or by direct intervention, our brain, which craves for stable
patterns, shall, if uncontrolled, tend to create such patterns out of nothing and be fooled
into believing in illusions (there exists an immense literature on gambling behavior and
perception fallacies which substantiates this statement).
This explains at least a subset of observed irrational behaviours of investors.
Statistics and Probability are also relevant because they can be seen as antidotes to
such delusions6 . They may not make us right most of times and it may quite be that
some lucky dumbo shall enjoy results better than ours. But, at least, they may help us
in not being upset if something which a priori we considered likely does indeed happen
or in preventing us when we may wish to change our decision rule due to events which
confirm the optimality of such rule. Maybe this is not much, but in the not too long
run it may count for much.
With the help of such tools we may understand, for instance, how, given the amount
of variance in the market ant the huge number of dumbo investors, the fact some
investor of this kind shall be better off than us “technically learned investors” is so
likely to be almost sure and, for this reason, this fact should not upset us or induce us
in dumbo behaviours.
There exists a classic literature on the topics of gambling, luck and delusions. It
goes back at the very least to Plutarch but traces can be seen in the Bible itself. It is not
peculiar of this literature the fact that its principles are time after time “rediscovered”
by well meaning while perhaps not sufficiently well read Authors. It is peculiar of it the
fact that such rediscovery always surprises the readers as if it presented them with new
ideas. This tells us much about our inability of learning how to deal with uncertainty.
No matter how many times delusions of luck are explained, they are bound to come
back again.
I suggest interested students to read a pair of quite engaging classics in the field:
“Chance and Luck” by the astronomer Richard Proctor (1887) and “Extraordinary Mass
Delusions and the Madness of Crowds” by Charles Mackay (1841) both available for
free on www.gutenberg.org.
Quantitative methods as legal disclaimer
In recent years, first, in the US and then all over the world, a new role of quantitative
models has surfaced and is becoming pervasive. That of legal disclaimer.
Only partial antidotes: at some time in the future everybody shall fell to the lure of “randomness
deciphering”. Anecdotes where the best statisticians “fell for it” pepper introductory and expository
books on Statistics and Probability.
6
11
Suppose you are an asset manager and your client is not satisfied by your results.
The client is going to question you about what did you do in order to get such results.
Maybe such questioning could take the form of a legal procedure where you are sued
for malpractice.
This is common to most professions, just consider the medical profession as a classic
instance.
Answering to such actions requires some definition of “right” practice, which, in
most professional fields, is very difficult to state (in principle, if such definition was
really possible, it would take the form of a computer program which could take the
place of the involved professionals). In most fields, think again to the medical field,
this has taken the shape of “protocols of intervention” which should define the lines of
action a professional should follow when dealing with a case.
This was undoubtedly a useful development in some field, but was also accompanied
by a huge increase of the bureaucratic aspect of any profession an, what is potentially
more dangerous, it did imply and increase of rigidity of behaviour when dealing with
situations where following the protocol is potentially dangerous for the client/patient.
In these cases the “protocol following” behaviour is useful for the professional, which
could defend on this basis, but could be fatal for the client.
The classic solution to this problem, stated by Shakespeare in one of his most
quoted verses: Henry VI, Part 2, Act IV, Scene 2. "The first thing we do, let’s kill all
the lawyers", may seem a little drastic.
In Finance, in particular in the asset management field, the establishment of best
practices and protocols takes the form, in most circumstance, of quantitative models of
asset selection and evaluation which, in order to get an easy legal validity, are mostly
based on standard academic arguments.
It much easier to defend yourself in a court of justice by saying: “the asset allocation
model we follow is *** and is based on published research by ***. The model gave a
small probability of negative results which, alas, did actually happen”, than saying: “I
choose the asset allocation on the basis of my gut feeling and past experience. It could
go well, it could go bad. Though luck: it did go bad”. While there may be not much
difference between the asset allocations, the first defense is certainly stronger than the
second.
This attitude is so widespread that, in some fields of Finance, it has become law:
just think about the risk management rules derived from the sequel of Basel agreements.
Just like in the field of medicine, such developments had some positive consequences
(beyond the disclaimer effect) in that they require discipline and clarity of analysis on
the part of agents.
They also contributed to spreading a “formal” use of quantitative models which,
when not continuously questioned and updated, could increase the risk of decision
taken “because the model says so” just in view of the legal disclaimer a model may
offer.
12
Whatever may be your evaluation of such a development, this is the world of finance
you are going to enter.
If you want to be clever agents in this work you need to understand it and understanding passes (also!) from the understanding of the quantitative tools applied for
quantifying and for taking risks.
Required Probability and Statistics concepts. Sections
1-5.
In the first 5 sections of these handouts only very simple concept from Statistics and
Probability are required. The most relevant are as follows:
expected value, variance, standard deviation, correlation, statistical independence.
Moments, quantiles. Binomial distribution. Gaussian distribution. Sampling variability. Tchebicev inequality. Confidence intervals. Hypothesis testing.
These should be already known to the Readers from their BA courses, a very short
summary is available in section19 of these notes and is a required part of this course.
For a quick check of the basic points see:
19.21, 19.22, 19.23, 19.41, 19.42, 19.48, 19.28, 19.29, 19.32, 19.34, 19.35, from 19.76
to 19.81, 19.114, 19.115, 19.116, from 19.118 to 19.136, 19.24.
Beyond definitions and basic properties, the most important point to have clear
in mind is the differences and the connections between Probability and a Statistics
concepts.
For a quick check go to 19
1
1.1
Returns
Return definitions
There is a love story with returns in Finance: while prices are the financially relevant
quantities (what we pay and what we get), we often speak and write models using
returns.
It is true that, for one period models, there is substantially no difference in considering a change in price and a return (a difference vs a percentage difference) as the
initial price is assumed known.
However, returns, while useful, can be tricky in multi period models or when using
time series data for estimating parameters in single period models. So, they must be
well understood.
13
Returns come in two versions. Let Pit be the price of the i − th stock at time t.
The linear or simple return (in the future we shall deal with dividends and with
total returns) between times tj and tj−1 is defined as:
ritj = Pitj /Pitj−1 − 1
The log return is defined as:
rit∗ j = ln(Pitj /Pitj−1 )
In both these definitions of return we do not consider possible dividends. There exist
corresponding definitions of total return where, in the case a dividend Dj is accrued
between times tj−1 and tj , the numerator of both ratios becomes Ptj + Dj .
Moreover, here we do not apply any accrual convention to our returns, that is: we
just consider period returns and do not transform these on a, say, yearly basis.
It is to be noticed that, while Ptj means “price at time tj ”, rtj is a shorthand for
“return between time tj−1 and tj ” so that the notation is not really complete and its
interpretation depends on the context. When needed, for clarity sake, we shall specify
returns as indexed by the beginning and the end point of the time interval in which
they are computed as, for instance, in rtj−1 ;tj .
The two definitions of return yield different numbers, for the same prices, when the
ratio between consecutive prices is far from 1.
Consider the Taylor formula for ln(x) for x near to 1:
ln(x) = ln(1) + (x − 1)/1 − (x − 1)2 /2 + ...
if we truncate the series at the first order term we have:
ln(x) ∼
=0+x−1
so that if x is the ratio between consecutive prices we have that for x near one the two
definitions give similar values.
It is also clear that ln(x) ≤ x − 1. In fact x − 1 is equal to and tangent to ln(x) in
x = 1 and above it everywhere else (the second derivative of ln(x) is negative while it
should change sign for ln(x) to cross x − 1 before or after the tangency point). This
implies that, if one kind of return is used and mistaken for the other, the approximation
errors shall be all of the same sign. We also see that the size of the error is increasing
roughly as (x − 1)2 .
In Finance the ratio of consecutive prices (sometimes called “total return” and
maybe corrected by taking into account accruals) is often modeled as a random variable
with an expected value very near 1. This implies that the two definitions shall give
different values with sizable probability only when the variance (or more in general
the dispersion) of the price ratio distribution is non negligible, so that observations
14
far from the expected value have non negligible probability. Since standard models in
Finance assume that variance of returns increases when the time between prices for
which the return is computed increases, this implies that the two definitions shall more
likely imply different values when applied to long term returns.
Why two definitions? The corresponding prices are the same and this implies that
both definitions, if not swapped by error, give us the same information.
The point is that each definitions is useful, in the sense of making computations
simple, in different cases.
From now on, for simplicity, let us only consider times t and t − 1.
Let the value of a buy and hold portfolio, composed of k stocks each for a quantity
ni , at time t be:
X
ni Pit
i=1..k
It is easy to see that the linear return of the portfolio shall be a linear function of
the returns of each stock.
P
X
ni Pit
ni Pit
P
−1=
−1
rt = P i=1..k
j=1..k nj Pjt−1
j=1..k nj Pjt−1
i=1..k
=
X
P
i=1..k
=
X
i=1..k
wit (rit + 1) − 1 = (
X
i=1..k
ni Pit−1
Pit
−1=
j=1..k nj Pjt−1 Pit−1
wit rit +
X
wit 1) − 1 =
i=1..k
X
i=1..k
wit rit + 1 − 1 =
X
wit rit
i=1..k
Where wit = P ni Pit−1
are non negative “weights” summing to 1 which represent
j=1..k nj Pjt−1
the percentage of the portfolio invested in the i-th stock at time t − 1.
This simple result is very useful. Suppose, for instance, that you know at time t − 1
the expected values for the returns between time t − 1 and t. Since the expected value
is a linear operator (the expected value of a sum is the sum of the expected values,
moreover additive and multiplicative constants can be taken out of the expected value)
and the weights wit are known, hence non stochastic, at time t−1 we can easily compute
the return for the portfolio as:
X
E(rt ) =
wit E(rit )
i=1..k
Moreover if we know all the covariances between rit and rjt (if i = j we simply have
a variance) we can find the variance of the portfolio return as:
X X
V (rt ) =
wi wj Cov(rit ; rjt )
i=1..k j=1..k
15
For log returns this is not so easy. In fact we have:
rt∗
P
X
X
ni Pit
nP
Pit
P i it−1
) = ln(
) = ln(
= ln( P i=1..k
wit exp(rit∗ ))
n
P
n
P
P
j=1..k j jt−1
j=1..k j jt−1 it−1
i=1..k
i=1..k
The log return of the portfolio is not a linear function of the log (and also the
linear) returns of the components. In this case assumptions on the expected values and
covariances of the returns of each security in the portfolio cannot be (easily) translated
into assumptions on the expected value and the variance of the portfolio return by
simple use of basic “expected value of the sum” and “variance of the sum” formulas.
Think how difficult this could make to perform any standard portfolio optimization
procedure as, for instance, the Markowitz mean/variance model.
Before going on with log returns we stress again an important point. All computations given here above suppose that prices at time t − 1 are known, that is: non
stochastic, moreover, we suppose the investment is not changing between t − 1 and t.
If this were not so, we could not make passages as:
X
wit E(rit )
E(rt ) =
i=1..k
We should be satisfied by the almost useless
X
E(wit rit )
E(rt ) =
i=1..k
This because wit is a function of prices at time t − 1which would now be stochastic,
and, even if the price at time t − 1 were known, the change of investment between time
t − 1 and t would make the weight and t, as seen from t − 1, stochastic. The same
problem for the computation of variance.
Stochastic Pt−1 , and/or a change of strategy, then, destroy the possibility of recovering the expected value and the variance of the portfolio linear return from the
expected values and the variances and covariances of the linear returns of the individual
securities.
Now, log returns.
These are much easier to use than linear returns when we aim at describing the
evolution of the price of a single security thru time.
Suppose we observe the price Pti at time t1 , ...tn the log return between t1 and tn
shall be:
rt∗1 ,tn = ln
Y Pt
X
Pt P t
Ptn
i
= ln n n−1 = ... = ln
=
rt∗i
Pt1
Ptn−1 Pt1
P
i=2...n ti−1
i=2...n
16
It is then easy, for instance, given the expected values and the covariances of the
sub period returns to compute the expected value and the variance of the full period
return (from t1 to tm ).
On the other hand, this is not true for the linear returns. We have:
rt1 ,tn =
Y Pt
Y
Pt P t
Ptn
i
− 1 = n n−1 − 1 = ... =
−1=
(rti + 1) − 1
Pt1
Ptn−1 Pt1
P
t
i−1
i=2...n
i=2...n
In general the expected value of a product is difficult to evaluate and does not
depend only on the expected values of the terms. A noticeable special case is that of
non correlation among terms. For the computation of the variance, the problem is even
worse.
It is clear that, when problems involving the modeling of portfolio evolution over
time are considered, we shall often see promiscuous and inexact use of the the two
definitions. You should keep in mind that standard "introductory" portfolio allocation
models are one period models, when we are considering multi-period portfolio models
the returns may be more difficult to use than simple prices.
To sum up: the two definition of returns yield different values when the ratio
between consecutive prices is not equal to 1. The linear definition works very well for
portfolios over a single period and conditional to the knowledge of prices at time t − 1:
expected values and variances of portfolios can be derived by expected values variances
and covariances of the components, as the portfolio linear return over a time period is
a linear combination of the returns of the portfolio components.
For analogous reasons the log definition works very well for single securities over
time.
We conclude this section with three warnings. These should be obvious but experience teaches the opposite.
First. Many other definitions of return exist and each one origins from either
traditional accounting behavior (and typically is connected with some specific asset
class) or from specific computational needs. These are usually based on linear returns
but use different conventions for computing the number of days between two prices and
the accrual of possible dividends and coupons.
Second. No single definition is the “correct” or the “wrong” one. In fact such a
statement has no meaning. The correctness in the use of a definition depends on the
context in which it is applied (accounting uses are to be satisfied) and, obviously, on
avoiding naive errors the like of exponentiating linear returns for deriving prices or
summing exponential returns over different securities in order to get portfolio returns.
For instance: the fact that, for a price ratio near to 1, the two definitions give
similar values should not induce the reader in the following consideration: “if I break a
sizable period of time in many short sub periods, such that prices in consecutive times
are likely to be very similar, I am going to make a very small error if I use, say, the
linear return in the accrual formula for the log return”. This is wrong: in any single sub
17
period the error is going to be small, but, as mentioned above, this error has always
the same sign, so that it shall sum up and not cancel, and on the full time interval the
total error shall be the same non matter how many sub periods we consider.
Third: this multiplicity of definitions requires that, when we speak about any
properties of “returns”, it should be made clear which return definition we have in
mind. For instance: the expected value of log returns must not be confused with the
expected value of linear returns. The probability distribution of log returns shall not
be the same as the probability distribution of linear returns, and so on.
Practitioners are very precise in specifying such definitions in financial contracts, the
common imprecision in financial newspapers can be justified in view of the descriptive
purposes of these. The same precision is not always found in academic papers.
18
19
08
‐0,6
‐0,4
‐0,2
0
0,2
0,4
0,6
0
0,2
0,4
0,6
0,8
1
r and r* as functions of Pt/Pt‐1
1,2
1,4
1,6
rr*
r
1.2
Price and return data
Finance is “full of numbers”, price data and related Statistics are gathered for commercial and institutional reasons and are readily available on free and commercial
databases. This has been true for many years and, for some relevant market, databases
have been reconstructed back to nineteen century and in some case even before.
As in any field where data are so overwhelmingly available and not directly created
by the researcher thru experiments, any researcher must be cautious before using them
and follow at least some very simple rules which could be summarized in the sentence:
“KNOW YOUR DATA BEFORE USING IT!”.
What does the number mean? How was it recorded? Did it always mean the same
thing? These are three very simple questions which should get an answer before any
analysis is attempted. Failure to do so could taint results to such a way as to make
them irrelevant or even ridiculous.
Avoid any oversimplifying position the like of the surprising one (if you consider
the usual quality of thought) by Schumpeter quoted in section 14.
This is not the place for a detailed discussion but it could be useful for us to try
and analyze a very simple example.
Suppose you wish to answer the following question: “how did the US stock market
behave during its history”.
You browse the Internet and run a search for literature on the topic. Suppose you
are able to shunt off conspiracy theorists, finance fanatics, quack doctors and serpent
oil sellers, Ponzi scheme advertising and the like.
let us say that you concentrate on academic and academic linked literature (which
by no means assures you to avoid the peculiar positions just listed).
At the onset you could be puzzled by the fact that, in the overwhelming majority
of papers and books, the performance of markets where thousands of securities, not
always the same, are traded and traded in different historical moments and under
different institutional rules, is summarized by a single number, and index. For the
moment we do not consider this point.
You find a whole jungle of academic and non academic references among which you
choose two frequently quoted expository books by famous academicians: “Irrational
exuberance” by Robert J. Shiller (of Yale) and “Stocks for the long run” By Jeremy J.
Siegel (of Wharton)7 . You browse through the first chapter of both and find Figure
1-1 of Siegel which tells you that 1 dollar invested in stock in 1802 would have become
7,500,000 dollars by 1997. Moreover you read that 1 dollar of 1802 is equivalent (according to Siegel) to 12 dollars in 1997. The real return should have been of about
625000 times in real terms (62,500,000% !)
On the other hand Figure 1.1 of Shiller’s book gives the following information:
The connection between the two authors and the two books is clearly stated by Shiller in his
Acknowledgments.
7
20
between 1871 and 2000 the S&P composite index corrected by inflation grew from
(roughly) 70 to (roughly) 1400 with a real return of roughly 20 times (2000%).
Both numbers are big, but also quite different.
Now you are puzzled.
Sure: a part of the difference is due to the different time basis.
Looking to Siegel picture you see that the dollar value of the investment around 1870
was about 200, even exaggerating inflation, attributing the full 12 times devaluation to
the 1870-2000 period, and assessing this 200 to be worth 2400 1997 dollars, we would
have a real increase of 3125 times which is still more than 150 times Shiller number.
This, obviously, cannot come from the difference in terminal years of the sample as
the period 1997-2000 was a bull market period and should reduce, not increase, the
difference.
Now, both Authors are famous Finance professors and at least one of them (Shiller)
is one of the gurus of the present crisis. So the problem must be in the reader (us).
Let us try and improve our understanding by reading the details. First we notice that Siegel quotes as source for the raw data the Cowles series as reprinted in
Shiller book “Market volatility” for the 1871-1926 period and the CRISP data for
the following period, while Shiller speaks about the S&P composite index. Reading with care we see another difference: Shiller speaks about a “price” index while
Siegel about a reinvested dividends total return index. Is this the trick? Browsing the Internet we see that Shiller’s data are actually available for downloading
(http://www.econ.yale.edu/∼shiller/data.htm). We can compute the total return for
Shiller data between 1871 and 1997 and the real increase now is from 1 dollar to 3654
dollars in real terms.
We also see that the CPI passed from 12 to 154 in the same time interval so that the
“12 times” rule for the value of the dollar used by Siegel seems a good approximation8 .
There is still some disagreement between the numbers (Siegel 3125, but with exaggerated inflation, and Shiller 3654) but we think that, at least for answering our
question, we have enough understanding.
In this very short and summary analysis we did learn some important things.
First: understand your question. “How did the US market behave during its history”
is, now we understand, not quite a well specified question. Are we looking for a
summary of the history of prices, or for the history of one dollar invested in the market?
The two different questions have two different answers and require different data.
Second: understand your data. Price data? Total return data? raw or inflation
Beware of long term inflation indexes. The underlying hypotesis is that the basket of consumption
goods be comparable thru time. As an anedoctal hint of the contrary: a very good riding horse in 1865
could cost 200 dollars, a “comparable status “ car costs, today, 50.000 dollars. If we, quite heroically,
compare the two “goods” due to their use we se an increase in price not of 12 times but of 250 times.
If we use the “12 times” rule, we get 2400 dollars which might be the price of a scooter. Which is the
right comparison?
8
21
corrected?
There are many subtle but relevant points that should be made, we only mention
the Survivorship Bias problem which taints the ex post use of financial series.
But we stop here for the moment and do not mention the fact that a lot of discussion
has run about the relevance of the questions and of the answers and their interpretation.
The fact is: Siegel and Shiller start with similar data but they reach quite different
conclusions (at least, this is their opinion on their work). We can reconcile the data:
we understand they are using the same data in two different ways. However, why each
of them draw a different conclusion and, moreover, why they “agree to disagree”?
1.3
Some empirical “facts”
While we realize if not fully understand these differences of opinion, this could be the
right place to state several empirical “facts” that underlie much of the discussion about
the long run behaviour of US stock market.
We do this with the yearly Shiller dataset (widely used in academic literature). We
shall concentrate on the total log return series. The dataset starts with 1887 and is
updated each year since in the latest available version Shiller uses the dataset up to
2013 included we shall limit our computations to the interval 1871-2013.
During this interval the average real log total return of the index was 6.33%.
In the same period the average real one year interest rate was 1.03% so that the so
called risk premium was about 5.3%.
The standard deviation of the real log total return was 17.09% while the same
statistic for one year real interest rates was 6.54%.
The 5.3% average real log total return in excess to the yearly rate (which was even
higher up to year 2000) compared with the 17.09% standard deviation (even smaller
than this up to 2000) did generate a literature concerned with the “equity premium
puzzle”.
The average of the real dividend yield (up to 2011 only) is 4.45% and the standard
deviation of the same is 1.5%.
The average real log price return was 2.16% and the standard deviation of the same
17.68%.
While we can only approximately sum these two result and compare them with the
total real log return, we see that most of the equity premium is associated with the
dividend y
Notice that the correlation coefficient between real dividend yield and real log price
return is .10 (positive but small) this explains why the standard deviation of the total
real log return is even smaller than the sum of the standard deviations of log real price
return and real dividend yield. On the other hand this small correlation is, by itself, a
puzzle.
22
A last piece of simple data analysis: the 1 year autocorrelation of the real total log
return series is very small: 2.29. This is a first simple evidence of the fact that is is
very difficult to forecast future returns on the basis of past returns.
Some of these empirical facts are at the basis of the simple stock price evolution
model we shall introduce in the next chapter.
23
24
FIGURE 1-1
Total Nominal Return Indexes, 1802-1997
25
Examples
Exercise 1a - returns.xls Exercise 1b - returns.xls
2
Logarithmic random walk
The (naive) log random walk (LRW) hypothesis on the evolution of prices states that,
if we abstract from dividends and accruals, prices evolve approximately according to
the stochastic difference equation:
ln Pt = ln Pt−∆ + t
where the “innovations” t are assumed to be uncorrelated across time (cov(t ; t0 ) =
0 ∀t 6= t0 ), with constant expected value µ∆ and constant variance σ 2 ∆. Sometimes,
a further hypothesis is added and the t are assumed to be jointly normally distributed.
In this case the assumption of non correlation becomes equivalent to the assumption
of independence.
Since ln Pt − ln Pt−∆ = rt∗ the LRW is obviously equivalent to the assumption
that log returns are uncorrelated random variables with constant expected value and
variance.
A specific probability distribution for t is not required at this introductory level.
It is, however, the case that, often, the log random walk hypothesis is presented from
scratch assuming t to be Gaussian, or normal. Notice that from the model assumptions
∗
we have: Pt = Pt−∆ ert = Pt−∆ et so, if t is assumed Gaussian, Pt shall be lognormally
distributed.
A linear (that is: without logs) random walk in prices was sometimes considered
in the earliest times of quantitative financial research, but it does not seem a good
model for prices since a sequence of negative innovations may result in negative prices.
Moreover, while the hypothesis of constant variance for (log) returns may be a good
first order approximation of what we observe in markets, the same hypothesis for prices
is not empirically sound: in general price changes tend to have a variance which is an
increasing function of the price level.
A couple of points to stress.
First: ∆ is the “fraction of time” over which the return is defined. This may be
expressed in any unit of time measurement: ∆ = 1 may mean one year, one month,
one day, at the choice of the user. However, care must be taken so that µ and σ 2 are
assigned consistently with the choice of the unit of measurement of ∆. In fact µ and σ 2
represent expected value and variance of log return over an horizon of Length ∆ = 1
and they shall be completely different if 1 means, say, one year (as usually it does) or
one day (see below for a particular convention in translating the values of µ and σ 2
between different units of measurement of time which is one of the consequences of the
log random walk model).
26
Second: suppose the model is valid for a time interval of ∆ and consider what
happens over a time span of, say, 2∆.
By simply composing the model twice we have:
ln Pt = ln Pt−2∆ + t + t−∆ = ln Pt−2∆ + ut
having set ut = t + t−∆ . The model appears similar to the single ∆ one and in
fact it is but it must be noticed that the ut while uncorrelated (due to the hypothesis
on the t ) on a time span of 2∆ shall indeed be correlated on a time span of ∆. This
means, roughly, that the log random walk model can be aggregated over time if we
“drop” the observations (just one in this case) in between each aggregated interval (in
our example the model shall be valid if we drop every other original observation).
This is going to be relevant in what follows.
The LRW was a traditional standard model for the evolution of stock prices. It is
obviously a wrong model, if understood as stating that prices are dictated by “chance”.
It can be considered as a good descriptive model in the sense that its success depends
not on its interpretation of the actual process of price creation (it would fail miserably)
but on its consistency with observed “large scale” (i.e. ∆ non too small) statistical
properties of prices. Consistency is measured by comparing probabilities of events as
given by the model with observed frequencies of such events.
Even from this point of view, while the model is not dramatically wrong and still
useful for introductory and simple purposes, the weight of empirical analysis during
the last thirty years has led most researchers to consider this hypothesis as a very
approximate description of stock price behavior.
While no consensus has been reached on an alternative standard model, there is
a general agreement about the fact that some sort of (very weak) dependence exists
for today’s returns on the full, or at least recent, history of returns. Moreover, the
constancy of the expected value and variance of the innovation term has been strongly
put under questioning.
In any case, the LRW still underlies many conventions regarding the presentation
of market Statistics. Moreover the LRW is perhaps the most important justification
for the commonly held equivalence between the intuitive term "volatility" and the
statistical entity "variance" (or better "standard deviation").
An important example of this concerns the “annualization” of expected value and
variance.
We are used to the fact that, often, the rate of return of an investment over a given
time period is reported in an “annualized” way. The precise conversion from a period
rate to a yearly rate depends on accrual conventions. For instance, for an investment
of less that one year length, the most frequent convention is to multiply the period
rate times the ratio between the (properly measured according to the relevant accrual
conventions) Length of one year and the Length of the investment. So, for instance,
27
if we have an investments which lasts three months and yields a rate of 1% in these
three months, the rate on an yearly basis shall be 4%.
It is clear that this is just a convention: the rate for an investment of one year
in length shall NOT, in general, be equal to 4%, this is just the annualized rate for
our three months investment. This shall be true, for instance, if the term structure
of interest rates is constant. However such a convention can be useful for comparison
across investment horizons.
In a similar way, when we speak of the expected return or the standard deviation/variance of an investment it is common the report number in an annualized way
even if we speak of returns for periods of less or of more than one year. The actual
annualization procedure is base on a convention which is very similar to the one used
in the case of interest rates. As in this case the convention is “true”, that is: annualized
values of expected value and variance correspond to per annum expected values and
variances, only in particular cases. The specific particular case on which the convention
used in practice is based is the LRW hypothesis.
If we assume the LRW and consider a sequence of n log returns rt∗ at times t, t −
1, t − 2, ..., t − n + 1 (just for the sake of simplicity in notation we suppose each time
interval ∆ to be of length 1 and drop the generic ∆) we have that:
X
X
∗
∗
∗
E(rt−n,t
) = E(
rt−i
)=
E(rt−i
) = nµ
i=0,...,n−1
∗
V ar(rt−n,t
) = V ar(
X
i=0,...,n−1
∗
rt−i
)=
i=0,...,n−1
X
∗
V ar(rt−i
) = nσ 2
i=0,...,n−1
This obvious result, which is a direct consequence of the assumption of constant
expected value and variance and of non correlation of innovations at different times, is
typically applied, for annualization purposes, also when the LRW is not considered to
be valid.
So, for instance, given an evaluation of σ 2 on daily data, this evaluation is annualized
multiplying it, say, by 256 (or any number representing open market days, different
ones exist), it is put on a monthly basis by multiplying it by, say, 25 and on a weekly
basis by multiplying it by, usually, 5.
As we stressed before, this is not just a convention, but the correct procedure, if
the LRW model holds. In this case, in fact, the variance over n time periods is equal to
n times the variance over one time period. If the LRW model is not believed to hold,
for instance: if the expected value and-or the variance of returns is not constant over
time or if we have correlation among the t , this procedure shall be applied but just as
a convention.9
Empirical computation of variances over different time intervals typically result in sequences
which tend to increase less than linearly wrt the increase of the time interval between consecutive
observations. This could be interpreted as the existence of (small) on average negative correlations
between returns.
9
28
The fact that, under the LRW, the expected value grows linearly with the length of
the time period while the standard deviation (square root of the variance) grows with
the square root of the number of observations, has created a lot of discussion about the
existence of some time horizon beyond which it is always proper to hold a stock portfolio. This problem, conventionally called “time diversification”, and more popularly
“stocks for the long run”, has been discussed at length both on the positive (commonly
sustained by fund managers) and the negative side (more rooted in academia: Paul
Samuelson is a non negligible opponent of the idea) we shall consider it in the next
section.
To have an idea of the empirical implications of the LRW hypothesis (plus that of
Gaussian distribution) for returns we plot in the following figures an aggregated index
of the US stock market in the 20th century together with 100 simulations describing
possible alternate histories of the US market in the same period, under the hypothesis that the index evolution follows a LRW with yearly expected value and standard
deviation of log return identical with the historical average and standard deviation:
resp. 5.36% and 18,1% (the use of %, as usual with log returns, is quite improper if
common). Data is presented both in price scale (starting value 100) and in log price
scale. The reason is simple. Consider the distribution of log return after 100 year under
our hypothesis. This is going to be the distribution of the sum of 100 iid Gaussian RV
each with expected value of 5.36% and standard deviation 18.1%, Using known results
we have that this distribution shall be Gaussian with expected value 536% and standard deviation 181%. So, a standard ±2σ interval for the terminal value of this sum
5.36±3.62
is 536%±362%, or, in price terms, 100e.
that is an interval with lower extreme
569 and upper extreme 794263. This means that under our hypotheses the possible
histories can be quite different. No problem in this if we recall the unconditional nature
of the model.
To get a quick idea: the actual evolution of the market as measured by our index
gave a final value equal to about 21000 which correspond, as said, to a sum of log
returns of 536%. This is, by construction, smack in the middle of the distribution of
the summed log returns and is the median of the price distribution. However, due
to the exponentiation, or if you prefer, due to the power of compound interest, the
distribution of final values is highly asymmetric (it is Lognormal) so that the range
of possible values above the median of prices is much bigger than that below it. We
only simulated 100 possible histories. Even with such a limited sample we have a top
terminal price of more than 2000000 (in a very lucky, for long investors, world. We
wonder what studying Finance would be in such a world...) and a bottom terminal
price below 100 (again: in a world so unlucky that, had we lived in it, we likely would
not talk about the stock market)10 .
Compare this with the Siegel-Shiller data we discussed in section 1, then think about the result of
our simulation in such extreme worlds. For instance, with the historical mean and standard deviation
of the extreme depressed version 20th century the simulation I would show you in this possible world,
10
29
This result could be puzzling as the “possible histories” seem very heterogeneous.
This is an immediate consequence of the log random walk hypothesis. If we estimate
µ and σ out of a long series of data (one century) we are using data from a very
heterogeneous set of economic and historical conditions. Then we use this number in
order to simulate “possible histories” without conditioning to any particular evolution
of historical or economical variables which could and shall influence the stock price.
In other words: we are using the log random walk model as a “marginal” model.
That is: it is unconditional to everything you may know or suppose about the evolution
of other variables connected with the evolution of the modeled stock price.
This point is quite relevant if we wish to understand the sometimes surprising
implications of this simple model.
In the above example, according to the model and the historically estimated parameters, we get the ±2σ interval 536%±362% (beware the % sign: these are log
5.36±3.62
returns), or, in price terms, 100e.
that is an interval with lower extreme 569 and
11
upper extreme 794263 . It must be clear that such a wide set of histories is possible,
with non negligible probability, only because we did assume nothing on the (century
long) evolution of all the variables that shall influence prices. Only under this “ignorance” assumption such an heterogeneous set of trajectories can have non negligible
probability.
If we are puzzled by the result this is because, while the model describes the possible
evolution of prices “in whatever conditions”, unconditional to anything (in fact, we
estimate expected return and standard deviation using a long history, during which
many different things happened), when we see the implications of the model we, almost
invariably, shall be conditioned by our recent memories and recall recent events or,
unconsciously, shall make some hypothesis on the future as, for instance, the fact that
provided you an I were still interested in this topic, would be quite different that what you see here.
And all the same, this possible story is a result totally compatible (under Gaussian LRW) with what
we did actually see in our real history. Spend a little time thinking about this point. It could be
“illuminating”.
Think also to the economic sustainability of such extreme worlds: such extreme market behaviours
cannot happen by themselves (this is not the plot of some lucky or unlucky casino guy, it is the market
value of an economy, which should sustain such values, provided investors are not totally bum) and
how they could be so absurd just because they underline the possible absurd extreme conclusions we
can derive from a simple LRW model.
Last but not least, remember that all this comes from the analysis of the stock market in a very,
up to now, successful country: the USA. But we analyze it so much also because it was successful
(and so, for instance, most Finance schools, journals and researchers are USA based. This biases our
conclusions if we wish to apply such conclusions to the rest of the world or, even, to the future of
USA. Maybe a more balanced view could be gained by comparing this result with the evolution of
stock markets all around the world (this is not a new idea, Robert J. Barro, for instance did this in
“Rare Disasters and Asset Markets in the Twentieth Century.” (2006) Quarterly Journal of Economics,
121(3): 823–66.)
11
By the way: this should be enough to understand why we should not use the term % when
speaking of log returns.
30
economic growth shall be, on average, similar to what recently seen. Since the estimates
of µand σ we use (or even our assumption of zero correlation of log returns and, more in
general, the structure of the model itself which contains no other variables but a single
price) are NOT conditional to such (implicit) hypotheses it is not surprising that the
model gives us such wide variation bounds with respect to what we could expect. This
misunderstanding is quite common and it is to be always kept in mind when discussing
results of the applications of the log random walk model12 .
There exists a wide body of literature, both from the applied and the academic sides, that suggests
ways for “conditioning” the model. This shall not be considered in this course, however in appendix
20 we consider a simple version of a possible conditional model.
12
31
32
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
0
20
40
60
80
100 years of simulated log random walk data
100 simulated paths
(mean log return 5.35% dev.st. 18.1%)
100
120
33
0
10000
20000
30000
40000
50000
0
20
40
60
80
100 years of simulated log random walk data (range subset)
compared with USA stock market in the 20th century
(mean log return 5.35% dev.st. 18.1%)
100
120
34
1
10
100
1000
10000
100000
1000000
10000000
0
20
40
60
80
100 years of simulated log random walk data
Log scale
compared with USA stock market in the 20th century
(mean log return 5.35% dev.st. 18.1%)
100
120
2.1
"Stocks for the long run" and time diversification
These are very interesting and popular topics, part of the lore of the financial milieu. A
short discussion shall be useful to clarify some issues connected with the LRW hypothesis together with some implicit assumption underlying much financial advertising.
We have three flavors of the “stocks for the long run” argument. The first and
the second are a priori arguments depending on the log random walk hypothesis or
something equivalent to it, the third is an a posteriori argument based on historical
data.
It is quite important to have a clear idea of the different weight and meaning of these
arguments. In fact, most of the “puzzling” statements you may find about advertising
“stock for the long run” by the investment industry depend on a wrong “mix” of the
arguments.13
2.1.1
First version
The basic idea of the first version of the argument can be sketched as follows. Suppose
single period (log) returns have (positive) expected value µ and variance σ 2 . Moreover,
suppose for simplicity that the investor requires a Sharpe ratio of say S out of his-her
investment. Under the above hypotheses, plus the log random walk hypothesis, the
Sharpe ratio over n time periods is given by
√ µ
nµ
√
= n
σ
nσ
so that, if n is big enough, any required value can be reached. Another way of phrasing
the same argument, when we add the hypothesis of normality on returns, is that, if we
choose any probability α, the probability of the investment to yield a n periods return
greater than
√
nµ − nz1−α σ
is equal to 1 − α. But this, for
√
n>
1 z1−α
σ
2 µ
is an increasing and unbounded (above) function, so that for any α and any chosen
value C, there exists a n such that from that n onward, the probability for an n period
return less than C is less than α.
The investment suggestion could be: if your time horizon is of an undetermined
number n of years, then choose the investment that has the highest expected return
per unit of standard deviation, even if the standard deviation is very high. Even if this
As an example of rather clever misunderstandings,
ument
“Time
Diversification
and
Horizon-Based
Asset
http://www.vanguard.com/pdf/icrtd.pdf?2210045172
13
35
read the Vanguard docAllocations”
available
at
investment may seem too risky in the "short run" there is always a time horizon so
that for that horizon, the probability of any given loss is as small as you like or, that
is the same, the Sharpe ratio as big as you like. Typically, such high return (and high
volatility) investment are stocks, so: "stocks for the long run".
Notice, however, that the value of n for which this lower bound crosses a given C
level is the solution of
√
nµ − nz1−α σ ≥ C
In particular, for C = 0 the solution is
√
z1−α σ
n≥
µ
With the typical stock the σ/µ ratio for one year is of the order of about 6. So,
even allowing for a big α, so that z1−α is near one (check by yourself the corresponding
α), the required n shall be in the range of 36 which is only slightly shorter than the
average working life.
It is important to understand that, by itself, we cannot judge such a statement as
correct or wrong.
This investment suggestion is or is not reasonable depending on the investor’s criterion of choice. This, for instance, could be the full period expected return given some
probability of a given loss, or the Sharpe ratio for the full n periods or, for instance, the
per period Sharpe ratio (which obviously is a constant) or, again, the absolute volatility
over the full period of investment (which obviously increases without bounds), and so
on.
For instance, a typical critique to the statement is phrased like this: "Why should
we consider as proper a given investment for n time periods if we do not consider
it proper for each single one of those periods?" This critique is correct if we believe
that the investor takes into account the per period Sharpe ratio or some measure of
probable loss and expected return per period. In other words the critique is correct if,
very reasonably, we believe the investor does not consider equivalent investments with
identical sharp ratios but over different time spans.
Another frequent critique is: "It is true: the expected value of the investment
increases without bounds but so does its volatility so, in the end, over the long run
I am, in absolute terms, much more uncertain on my investment result" (the meanstandard deviation ratio goes up only because the numerator grows faster than the
denominator). This is reasonable as a critique if we believe the investor to decide on
the basis of the absolute volatility of the investment over the full time period.
We should also point out that to choose a single asset class only because, by itself, it
has the highest Sharpe ratio, should always be criticized on the basis of diversification
arguments.
In the end, acceptance or refusal, on an a priori basis, of this argument depend on
how we model the investor’s decision making. However, in general, it cannot be labeled
36
as “wrong”: there may be a point in it, at the least if you are a very peculiar kind of
investor.
2.1.2
Second version
The second version of the argument, again based on the log random walk hypothesis,
is a real fallacy (that is: it is impossible to justify it in any reasonable way) and is the
so called "time diversification" argument.
There is an enticing similitude, under the log random walk hypothesis, between
an investment for one year in, say 10, uncorrelated securities with identical expected
returns and volatilities (this last hypothesis is just for simplicity: the argument can be
extended to different expected returns and volatilities), and a 10 year investment in a
single security with the same expected value and volatility.
To be precise, in order for the result to hold we must forget the difference between
linear and log returns, moreover the comparison implicitly requires zero interest rates.
But let’s do it: such an “approximate” way of thinking is very common in any field
where some Mathematics is used for practical purposes and it is a sound way to proceed
provided the user is able to understand the cases where his-her “approximations” do
not work.
In this case, the expected return and standard deviation for the return corresponding to the first strategy (which could be tagged as the "average per security" return)
are µ and √σn , just the same as the expected value and standard deviation for the
"average per year" return of the second strategy.
We should be wary from the beginning in accepting such comparisons: in fact the
two investments cannot be directly compared since they are investments of the same
amount but on different time periods.
Moreover, but this is not independent from the previous comment, the comparison
is based on the flawed idea that the expected return and variance of the first investment,
can be compared with the average per year expected return and variance of the second
investment. In fact, while the expected return and variance of the first investment are
properties of an effective return distribution (that is the distribution of a return which
I could effectively derive from an investment) the average expected return and variance
of the second investment are not properties of a return which I could derive from the
second investment.
All that I can derive from the second investment is the distribution of returns over
the ten years period which, obviously, has ten times the expected value and ten times
the variance than the distribution of the average return (which, we stress again, is not
the return I could get by the investment).
So, no time diversification exists but only a wrong comparison between different
investments using different notions of returns.
Comparable investments could be a ten year investment in the diversified portfolio
37
and a ten year investment in the single security and a possible correct comparison
criterion could be the comparison between the ten year expected return and return
variance of the two investments. However, in this case the diversified investment is
seen to yield the same expected value of the undiversified investment but with one
tenth of the variance so that, these two investments, now comparable, are by no means
equivalent and the single security investment is seen, in the mean variance sense, as an
inferior investment.
Analogously, we could ask which investment on a single security over ten years has
the same return mean and variance as the one year diversified investment. The obvious
answer is: an investment of one tenth the size of the diversified investment. In other
words: in order to have the same effective (that is: you can get it from an investment)
return distribution the two investments must be non only on different time periods but
also of different sizes.
While the first version of the argument could be argued for, at least under some
hypothetical, maybe unlikely but coherent, setting, this second version of the argument
is a true fallacy.
2.1.3
Third version
The third version of the stocks for the long run argument is the soundest as it can be
argued for without being liable of unlikely assumptions or even blatant logical errors.
It is to be noticed that this third version is not an a-priori argument, based on
assumptions concerning the stochastic behavior of prices and the decision model of
agents (and, maybe some logical error). Instead, it is an "a posteriori" or "historical"
version of the argument. As such its acceptance or rejection entirely depends on the
way we study historical data.
In short this argument states that, based on the analysis of historical prices, stocks
were always, or at least quite frequently, a good long run investment.
Being an historical argument, even if true (and here is not the place to argue for
or against this point) this it does not imply that the past behavior should replicate in
the future.
The following figure from “Stocks for the long run” summarizes Siegel’s argument:
on the horizontal axis we see the holding periods, on the vertical axis the up and
down bars plot the maximum and minimul holding period total return (in real term)
for all possible holding periods of the same lenght. This is plotted for stocks, bonds
(medium/long term govies) and t-bills (short term govies). Returns are expressed in
average per year terms. As it can be seen (and as is quite obvious) the range between
best and worst average returns shrinks as the holding period increases. What is less
obvious is that the worst mean return for stocks is less negative than the corresponding
value for bonds and bills beginning with the 13 years holding period onward while the
best mean return is always higher.
38
Actually, the figure could be misleading as it compares returns for different securities
on different time periods. Much more interesting could be a figure which compares the
spreads over the same periods of investments in stocks VS bonds, bonds VS bills and
stocks VS bills and displays, of these three kind of investments, best average and
worst result for each holding period. A possible help can be derived by another figure
in “stocks for the long run”. In this picture we see tha standard deviation of the
average return for different holding periods (this time, supposedly, non overlapping) of
investments in the tree securities. These are compared with the theoretical standard
deviation for the investment in stocks derived from the hypothesis of iid returns log
(our log rendom walk). As commonly√observed, the actual standard deviation for
mean returns decreases faster than 1/ n: the order implied by the random walk
hypothesis and this may imply a slight negative correlation between returns which
becomes relevant for long holding periods. Both these are empirical facts. Many
different interpretations are possible and, as already stated above, it is perhaps better
39
to consider them just as historical facts and without trying to deduce “general rules” or
“models” which, while easily adapted to past date, would most likely be a very fragile
statement on possible futures.
While apparently held by the majority of financial journalists (provided they do
not weight too much, say, the last 30 years of prices in Japan or the last 10 to 15 years
for most of the rest of the world), and broadly popular in trouble free times (at least
as popular as the, historically false, argument about real estate as the most sure, if not
the best, investment), and so quite popular for most time periods, at least in the USA
and during the first thirty and the last fifty years of the past century, this argument is
quite controversial among researchers.
The two very famous and quite readable books we quoted in the chapter about
returns: Robert Shiller’s "Irrational Exuberance" vs Jeremy Siegel’s "Stocks for the
Long Run" share (sic!) opposite views on the topic (derived, as we hinted at but do
not have the time to fully discuss, from different readings of the same data).
While not the place for discussing the point, we would suggest the reader, just for
the sake of amusement, to consider a basic fault of such "in the long run it was ..."
arguments.
40
We have a typical example of the case where the fact itself of considering the
argument, or even the phenomenon itself to which the argument applies, depends of
the fact that the phenomenon itself happened, that is: something "was good in the
long run".
In fact we could doubt about the possibility for an institution (the stock market)
which survives in the modern form, at least in the USA, since, say, the second half
of nineteenth century, to survive up to today, without at least giving a sustainable
impression of offering some opportunities.
Such arguments, if not accompanied by something else to sustain them, become
somewhat empty as could be the analogue to being surprised observing that the dish I
most frequently eat, is also among those I like the most or, more in the extreme, that
old people did non die young or, again, that wen we are in one of many queues, we
spend most time in the slowest queue.
Sometimes, however, the "opportunity" of some institution and how to connect
this with its survival can manifest in strange, revealing ways. For instance, games
of chance exist from unmemorable time with the only "long run" property of making
the bank holder richer, together with the occasional random lucky player. The overall
population of players is made, as a whole, poorer. So, while it is clear here what is the
"opportunity" of this institution (both for the, usually, steadily enriched bank holder
and the available, albeit unlikely, hope of a quick enrichment), the survival of such
an institution based on such opportunities tells us something interesting about man’s
mind.
We shall get into this topic time and again in what follows (while we won’t be
able to analize it in full). This should not puzzle the reader as it is the bitter bread
and butter of any research field where we decide to use Probability and Statistics for
writing and testing models, but only observational data are available and no (relevant)
experiments are possible. Let us mention some of these fields: evolutionary biology,
cosmology, astronomy.
In all these fields we are overflowed by data (as in Finance and Economics) but
data does not come from experiment and, most important, the observer is part of the
dataset and observation is not independent on what is observed.
A possible alternative, actually chosen by similar fields the like of history, is to
abandon, or not even think it a serious possibility, the writing of models in Probability
language and the testing of these with Statistics. In such fields, Statistics is still used:
not as a tool for testing models but as a tool for describing historical data. Fields the
like of political sciences and sociology are divided in their attitude.
If we like fringe movements, there exist a minority of historians, mostly inspired by
“Chicago area” new economic history or cliometrics, publishing in Economics journals
and not very well considered by mainstream historians, which try to deal with history
problems using Probability and Statistics (usually adapting models from Economics).
On the other side, a not small number of economists believe that the mainstream atti41
tude to Economics shows an excess in the use of such tools and state that there exists
useful economic knowledge which cannot be expressed in any available mathematical/probabilistic language. In extreme cases the extreme statement is made according
to which only irrelevant points of Economics can be described with such tools14 .
2.2
*Some further consideration about log and linear returns
The log return has many uses beyond the fact that it sums over time.
Consider the following “game”.
Your initial wealth is W0 . At each time t you flip a coin with probability P of head
and 1 − P of tail. If head comes up Wt+1 = Wt u otherwise Wt+1 = Wt d.
To fix the ideas set P = .5, u = 1.5 and d = .6.
This seems a good game, at least at first
Compute E(W1 /W0 |W0 ) this is equal to (u + d)/2 = 1.05. Now, compute
E(W2 /W0 |W0 ) = (uu + ud + du + dd)/4 = (2.25 + .9 + .9 + .36)/4 = 1.1025
In general, we have
u
E(Wn /W0 |W0 ) = E(unu dn−nu ) = dn E(( )nu )
d
Where nu is the (random) number of heads (on n flips).
Since nu is Binomial(n, P ) we can approximate its distribution with a N (nP, nP (1−
P )).
Write now ( ud )nu = exp(nu ln( ud )). Using the CLT approximation nu ln( ud ) is (approximately) distributed according to N (nP ln( ud ), nP (1 − P ) ln( ud )2 ).
We then have that exp(nu ln( ud )) is Log-normal with expected value exp(nP ln( ud ) +
1
nP (1 − P ) ln( ud )2 ).
2
Putting this together we find:
u
1
u
E(Wn /W0 |W0 ) ' dn exp(nP ln( ) + nP (1 − P ) ln( )2 ) =
d
2
d
u
1
u
= exp(nP ln( ) + nP (1 − P ) ln( )2 + n ln(d))
d
2
d
With our numbers this is exp(n ∗ 0.05229) > 1 and unbounded.
This approximation is quite good even for n = 1 (1.05368) and n = 2 (1.11024).
Notice that, with this computation, we are considering the “average across trajectories”, that is, e.g. for n = 2, we have 4 possible trajectories, with final results W0
times 2.25, .9, .9, .36 and we are considering the expected value across these.
For some short remark about the debate on Mathematics and Economics see the appendix on
pag. 205
14
42
Let us look at the game in a different way.
u
Wn /W0 = ( )nu dn = exp(ru nu + rd (n − nu ))
d
where ru = ln(u) and rd = ln(d) are the log returns corresponding to u, d.
Log returns sum over time. We have, then,
E(ln(Wn /W0 )|W0 ) = E(ru nu + rd (n − nu )) = n(ru P + rd (1 − P ))
With our numbers E(ln(Wn /W0 )|W0 ) = −n.05268.
The fact that the two results are different is, by itself, not striking: the first expected
value, positive and unbounded, is the “arithmetic average” of the possible linear returns
across different “histories”, the second is the average of the possible log returns of the
same.
We could agree that, since what has monetary meaning is not log but linear return,
we should only consider the positive and not the negative result.
Before further commenting this, however, there is a second, quite interesting result.
Take a story of length n, at the end of this story you have more money than at the
beginning iff
u
Wn
= ( )nu dn > 1
W0
d
Let us compute the probability of this event:
u
u
u
P (( )nu dn > 1) = P (nu ln( ) + n ln(d) > 0) = P (nu /n > ln(d)/ ln( ))
d
d
d
Here, again, we use the Gaussian approximation and recall that, approximately,
nu /n ≈> N (P, P (1 − P )/n).
With this we have:
p
p
√
√
u
u
P (nu /n > ln(d)/ ln( )) = P ( n(nu /n−P )/ P (1 − P ) > n(ln(d)/ ln( )−P )/ P (1 − P )) '
d
d
p
√
u
' 1 − Φ0,1 ( n(ln(d)/ ln( ) − P )/ P (1 − P ))
d
√
Using our numbers this becomes 1 − Φ0,1 ( n ∗ .1148) and this goes to 0 if n grows.
In words: the probability of being on a trajectory where I “win at n” goes to 0 for
n increasing. The bigger n the less likely that, if I choose randomly a trajectory of
length n I win.
Let us summarize:
1. In this game if n is big enough, I am almost sure to lose.
43
2. However, if I take the expected value of the percentage increase of wealth (linear
return), from the beginning to n, across trajectories this is positive ans increasing
unboundedly with n.
How is this possible?
The answer is simple: on most trajectories you lose (you see this even if n = 2: you
lose in 3 out of 4 trajectories). However, IF YOU WIN YOU MAY WIN BIG: there
are trajectories which have probability going to 0, but along which you win big.
Since the percentage amount won along these few lucky trajectories grows with n
much faster then their probability decreases, if you compute the arithmetic mean of
the possible terminal results this is positive and unbounded.
If a big population plays this game and the coin toss results are independent across
different players, we shall see that most players are going to lose their money, while
few (the fewer the bigger n, that is, the longer is the game) win big.
If the game is the solo source of income of this population, we should observe a
growing and extreme concentration of wealth.
If the population of gamblers is finite, however, for big n the probability of observing
even just 1 winner goes to 0. If the size of the population is constant and equal to N ,
the expected number of winners: N P (nu /n > ln(d)/ ln( ud )) quickly becomes smaller
that 1.
The expected log return, on the other hand, is negative and unbounded below
because it only grows linearly with the number of wins and not exponentially like the
linear return. The trajectories on which you “win” are the same, both if you measure
your winnings in log return and linear return terms. Obviously, also the probability of
each trajectory is the same as it only depends on the number of heads and tails not on
how you compute the return.
However, as we know, the linear return is always bigger than the log return, except
in the case of zero return. This means that, on positive trajectories, the linear return
shall grow much faster, and in negative trajectories it shall fall slower that log returns.
This difference is such that the same (trajectory) probability, multiplied by the
linear return of the trajectory and summed over trajectories, shall be positive and
unbounded, while, multiplied by the log return of the same trajectory (that is: the
log of one plus the linear return) and summed over trajectories, shall be negative and
unbounded.
Would you play this game? If you are risk neutral you should: you are only interested in the expected “monetary” (linear) return, you are not worried by the fact that
the most likely result is that you lose.
If you are risk averse (e.g. you have a log utility) you should not.
Historical note: this example is a reworked version of the famous “St. Petersburg
paradox” by Nicholas Bernoulli, considered and “solved” (it would be better to say:
“understood”) by his cousin Daniel Bernoulli while working for the Czar in St. Petersburg.
44
The result is a “paradox” only in the sense that it makes clear the implications of
being risk neutral. If we are puzzled by these implications, we can deduce that we are
NOT, by instinct, risk neutral.
We do not think it reasonable to willingly play a game where we are almost sure
to lose (for big n), just because the expected win in the game is positive (and even
unbounded).
The “solution” by Daniel Bernoulli is the understanding of this: that we are puzzled
by the “paradox” because we are not risk neutral “by instinct”.
If you evaluate your win, still using the expected value, not in linear return terms
(a win of k times the original sum is evaluated k) but in log return terms (a win of k
times the original sum is evaluated ln(k)), you introduce a “risk averse” utility function
and would not play the game.
But there is more: once understood this, a deeper “paradox” becomes clear.
n
is greater that 1 and unbounded. On
As computed above, the expected value of W
W0
the other hand, the probability of being on a non losing trajectory tends to 0. This
means, as stated above, that the probability that in a finite population of players at
least one player wins tends to zero.
If we put these three statements together we immediately understand that, even
n
is greater that 1 and unbounded, with probability which
if the expected valueof W
W0
tends to 1 with n increasing, all players shall be losers.
n
This means that for almost all players the realized W
shall be less than 1.
W0
Suppose, now, that the players do not know the expected value of the game and
need to estimate it. After all, we can evaluate the expected value because we suppose
we “know” the probabilities.
Almost all players and, in the limit, a set of players of probability one, shall observe
Wn
<1,
why should they believe that the expected value of this is positive and increasing
W0
with n?
This example shall make clear the difference between:
1) computing the expected value of the game using ALL possible (in the limit
infinite) trajectories each with its probability
2) computing the average result of a FINITE set of players
In 1) we consider all trajectories. Some of these, a set of small and decresing (with
n increasing) probability, imply enormous winnings which “balance” the much more
likely losses.
In 2) we only have a finite population and, with big n the probability of observing
even a single “winner” in this population goes to 0.
Last point: the game we discussed is by no means strange. It is the classic recombining (multiplicative) binomial tree presented in basic option pricing courses.
45
Examples
Exercise 2 - IBM random walk.xls
3
Volatility estimation
In applied Finance the term “volatility” has many connected meanings. We mention
here just the main three:
1. Volatility may simply mean the attitude of market prices, rates, returns etc. to
change in an unpredictable and unjustified manner. This without connection to
any formal definition of “change”, “unpredictable” or “unjustified”. Here volatility is tantamount chance, luck, destiny, etc. Usually the term has a negative
undertone and is mainly used in bear markets. In bullish markets the term is
not frequently used and it is typically changed in more “positive” synonym. A
volatile bull market is “exuberant”, “tonic” or “lively”.
2. More formally, and mostly for risk managers, volatility has something to do with
the standard deviation of returns and, sometimes, is estimated using historical
data (hence the name “Historical Volatility”.
3. For derivative traders and frequently for risk managers “volatility” is the name of
one (or more) parameters in derivative models which, under the hypotheses that
make “true” the models, are connected with the standard deviation of underlying
variables. However, in the understanding that these hypotheses are never valid in
practice, such parameters are not estimated from historical data on the underlying
variables (say, using time series of stock returns) but directly backwarded from
quoted prices of derivatives, using the pricing model as fitting formula. This is
in accord to the strange, but widely held and, in fact, formally justifiable, notion
that models may be useful even if the hypotheses underlying them are false. This
is “Implied Volatility”.
In what follows we shall introduce a standard and widely applied method for estimating
volatility on the basis of historical data on returns, that is, we consider the second
meaning of volatility.
Under the LRW hypothesis a sensible estimate of σ 2 is:
X
2
∗
rt−i
− r∗ /n
i=0,...,n
Where r∗ is the sample mean.
46
This is the standard unbiased estimate for the variance of uncorrelated random
variables with identical expected values and variances (the simple empirical variance of
the data, where the denominator its taken as the actual number of observations n + 1,
could be used without problems as in standard applications the sample size is quite
big).
Notice that each data point is given the same weight: the hypothesis is such that
any new observation should improve the estimate in the same way.
The log random walk would justify such an estimate.
In practice, nobody uses such estimate and a common choice is the exponential
smoothing estimate, while already quite old when suggested by J. P. Morgan in the
RiskMetrics context, this is commonly known in the field as the RiskMetrics estimate:
P
i ∗2
i=0,...,n λ rt−i
Vt = P
i
i=0,...,n λ
From a statistician’s point of view this is an exponentially smoothed estimate with λ
a smoothing parameter: 0 < λ < 1.
Common values of the smoothing parameter are around 0.95.
Users of such an estimate do not consider sensible to consider each data point
equally relevant. Old observations are less relevant than new ones.
Implicitly, then, while we “believe” the log random walk when “annualizing” volatility, we do not believe it when estimating volatility.
Moreover it shall be noticed that, in this estimate, the sampling mean of returns
does not appear. This is a choice which can be justified in two ways: first we can assume
the expected return µ over a small time interval to be very small. With a non negligible
variance it is quite likely that an estimate of the expected value of returns could show
an higher sampling variability than its likely size and so it could create problems to
the statistical stability of the variance estimate15 . Second, an estimate of the variance
where the expected value is set to 0 tends to overestimate, not to underestimate, the
variance (remember that variance equals the mean of squares less the squared mean.
if you set the second to 0 you exaggerate the estimate). For institutional investors,
traditionally long the market, this could be seen as a conservative estimate. Obviously
this may not be a reasonable choice for hedged investors and derivative traders.
A simple “back of the envelope” computation: say the standard deviation for stock returns over
one year is in the range of 30%. Even in the simple case where data on returns are i.i.d., if we estimate
the expected return over one year with the sample mean we need about 30 observations (years!) in
order to reduce the sampling standard deviation of the mean to about 5.5% so to be able to estimate
reliably risk premia (this is financial jargon: the expected value of return is commonly called ’risk
premium’ implying some kind of APT and even if it also contains the risk free rate) of the size of at
least (usual 2σ rule) 8%-10% per year (quite big indeed!). Notice that things do not improve if we use
monthly or weekly or daily data (why?). It is clear that any direct approach to the estimate of risk
premia is doomed to failure. A connected argument shall be considered at the end of this chapter.
15
47
The apparent truncation at n should be briefly commented. As we have just seen
the standard estimate should be based on the full set of available observations. This
could be applied as a convention also to the RiskMetrics estimate. On the other hand
consider the fact that, e.g., a λ = 0.95 raised to the power of 256 (conventionally one
year of daily data) is less than 0,000002. So, at least with daily data, to truncate n
after one year of data (or even before) is substantially the same as considering the full
data set.
As it is well known:
N
X
1 − λN +1
λi =
1−λ
i=0
So that (for 0 < λ < 1)
X
λi = 1/(1 − λ)
i=0,...,∞
We can then approximate the Vt estimate as:
X
∗2
Vt = (1 − λ)
λi rt−i
i=0,...,n
In order to understand the meaning of this estimate it is useful to write it in a recursive
form (this is also useful for computational purposes). We can directly check that:
Vt = λVt−1 + P
∗2
λn+1 rt−n−1
P
−
i
i
i=0,...,n λ
i=0,...,n λ
rt∗2
In fact, since
P
Vt−1 =
∗2
λi rt−1−i
i
i=0,...,n λ
i=0,...,n
P
We have
P
Vt = λ
∗2
∗2
λi rt−1−i
λn+1 rt−n−1
rt∗2
P
P
+
−
=
i
i
i
i=0,...,n λ
i=0,...,n λ
i=0,...,n λ
i=0,...,n
P
∗2
∗2
λi+1 rt−1−i
λn+1 rt−n−1
rt∗2
P
P
=
+
−
=
i
i
i
i=0,...,n λ
i=0,...,n λ
i=0,...,n λ
P
P
∗2
i ∗2
rt∗2 + i=0,...,n−1 λi+1 rt−1−i
i=0,...,n λ rt−i
P
=
= P
i
i
i=0,...,n λ
i=0,...,n λ
P
i=0,...,n
P
Which is the definition of Vt .
48
For the standard range of values of λ and n the last term can be approximated with
0 .
Using the approximate value of the denominator we have:
16
Vt = λVt−1 + (1 − λ)rt∗2
In practice the new estimate Vt is a weighted mean of the old estimate Vt−1 (weight
λ, usually big) and of the latest squared log return (weight 1 − λ, usually small).
A simple consequence of this (and of the fact that the estimate does not consider the
mean return) is the following. Since the squared return is always non negative and λ is
usually near one, this formula implies that Vt , even if the new return is 0, is still going to
be equal to λVt−1 so that the estimated variance at most can decrease of a percentage of
1 − λ. On the other hand, it can increase, in principle, of any amount when abnormally
big squared returns are observed. This implies an asymmetric behavior: following any
shock, which introduces an abrupt jump in Vt , while a sequence of “normal” values for
returns shall reduce the estimated value in a smoothed way, the faster the smaller is
λ. The reader should remember that this behavior of estimated volatility is purely a
feature of the formula used for the estimate.
The use of such an estimate of σ 2 implies a disagreement with the standard version
LRW hypothesis, as described above,e as it implies a time evolution of the variance of
returns. The recursive formula:
Vt = λVt−1 + (1 − λ)rt∗2
is the empirical analogue of an auto regressive model for the variance of returns the
like of:
2
σt2 = γσt−1
+ 2t
which is a particular case of a class of dynamic models for conditional volatility
(ARCH Auto Regressive Conditional Herteroschedastic) of considerable fortune in the
econometric literature.
The above discussion, involving the smoothed estimate for the return variance, is
by no means just a fancy theoretical analysis or a curiosity related to RiskMetrics. It is
the basis of current regulations. Here I reproduce a paragraph of the EBA (European
Banking Authority) paper EBA/CP/2015/27.
Article 38 Observation period
1. Where competent authorities verify that the VaR numbers are computed using an effective historical observation period of at least one year, in
∗2
λn+1 rt−n−1
16 P
i
i=0,...,n λ
∗2
' (1 − λ)λn+1 rt−n−1
and for (0 < λ < 1), big n and any squared return “not too big”,
this shall be approximately 0
49
accordance with point (d) of Article 365(1) of Regulation (EU) No 575/2013,
competent authorities shall verify that a minimum of 250 business days is
used. Where institutions use a weighting scheme in calculating their VaR,
competent authorities shall verify that the weighted average time lag of the
individual observations is not less than 125 business days.
2. Where, according to point (d) of Article 365(1) of Regulation (EU)
No 575/2013 the calculation of the VaR is subject to an effective historical
observation period of less than one year, competent authorities shall verify
that the institution has in place procedures to ensure that the application
of a shorter period results in daily VaR numbers greater than daily VaR
numbers computed using an effective historical observation period of at least
one year.
The quoted Article 365(1) of Regulation (EU) No 575/2013 (On prudential requirements for credit institutions and investment firms), is as follows:
Article 365
VaR and stressed VaR Calculation
1. The calculation of the value-at-risk number referred to in Article 364
shall be subject to the following requirements:
(a) daily calculation of the value-at-risk number;
(b) a 99th percentile, one-tailed confidence interval;
(c) a 10-day holding period;
(d) an effective historical observation period of at least one year except
where a shorter observation period is justified by a significant upsurge in
price volatility;
(e) at least monthly data set updates.
The institution may use value-at-risk numbers calculated according to
shorter holding periods than 10 days scaled up to 10 days by an appropriate
methodology that is reviewed periodically.
2. In addition, the institution shall at least weekly calculate a “stressed
value-at-risk” of the current portfolio, in accordance with the requirements
set out in the first paragraph, with value-at-risk model inputs calibrated to
historical data from a continuous 12-month period of significant financial
stress relevant to the institution’s portfolio. The choice of such historical
data shall be subject to at least annual review by the institution, which shall
notify the outcome to the competent authorities. EBA shall monitor the
range of practices for calculating stressed value at risk and shall, in accordance with Article 16 of Regulation (EU) No 1093/2010, issue guidelines
on such practices.
The meaning of this rule is that, if you use the exponentially smoothed
trunPN j estimate
i
i 1−λ
cated at N so that your (daily data) weights are wt−i = λ / j=0 λ = λ 1−λN +1 , it
50
must be that the “weighted average time lag of the individual observations” that is:
P
N
i=0 iwt−i be at least 125 (days). This, for given N requires a specific choice of λ.
Notice that if N = 250 the only possible choice is λ = 1. In order to decrease the λ
and respect the rule you must increase N 17 .
Since for N → ∞the weighted average time lag is λ/(1 − λ) the requirement asks
in any case (that is: whatever be N ) for a value λ > 125/126 = .992063. An even
bigger number shall be needed for moderate N . This is much bigger than what used
to be the common case in the past. The examples in the “classic” edition of the Risk
Metrics Technical Document (iv edition 1996) use λ = .94 which, even with very big
N , corresponds to a weighted average time lag of .94/.06 = 15.(6) by far too small
according to the new rules.
3.1
Is it easier to estimate µ or σ 2 ?
It is useful to end this small chapter discussing a widely hold belief, supported by some
empirical result, according to which the estimation of variances (and in a lesser degree
of covariances) is an easier task than the estimation of expected returns at least in the
sense that the percentage standard error in the estimate shall be smaller that in the
case of expected return estimation.
The educated heuristics underlying such a belief are as follows18 .
Consider log returns from a typical stock, let them be iid with expected value (on
an yearly basis) of .07 and standard deviation .3. The usual estimate of the expected
value, that is the
√ arithmetic mean, shall be unbiased and with a sampling standard
deviation of .3/ n where n is the number of years used in the estimation. Hence, the tratio, that is the ratio
√ of the estimate with its standard error,√under these hypotheses,
shall be, roughly n/4. Hence, for a t-ratio of 2 we need n = 8 that
√ is n = 64
(years!). If we want a standard error equal to 1/2 the estimate we need n = 16 and
n = 256. 256 years of data for a 2σ confidence data that still implies a possible error
of 50% in the estimate of µ.
This simple back on the envelope computation explains why we know so little about
expected returns: if our a priori are correct then it is very difficult to estimate them.
There could be a way out. Do not use yearly data but, say, monthly data.
Alas, for log returns and under log random walk this does not work.
Keep n constant and use any k sub periods per year (of length 1/k in yearly terms)
This can easily be done with Excel or similar. There is also a partially explicitPsolution. In fact,
N
1−λ
using some algebra, we see that the required “weighted average time lag” is E(i) = i=0 iλi 1−λ
N +1 =
17
N +1
+1)λ
− (N1−λ
. The problem becomes that of choosing λ and N such that this value is at least
N +1
equal to 125.
18
This point is dicussed in many papers and book chapters. Among the most illustrious examples, see
Appendix A in: Merton, R.C., 1980.”On estimating the expected return on the market: an exploratory
investigation”. J. Financ. Econ. 8, 323–361.
λ
1−λ
51
such that the number of observations in n years (for returns over the sub periods) is
kn. The strategy could be that of estimating the sub period expected value µk = µ/k
(the equality is due to the log random walk hypothesis) and then get an estimate of
the yearly expected value by multiplying the monthly estimate by k. If we indicate
∗
the log returns for the sub periods, with σk2 = σ 2 /k as variance, would have:
with rki
V (µ̂k ) = V (
kn
X
∗
/kn) = σk2 /kn = σ 2 /k 2 n
rki
i=1
This seems much better that before, but it is an illusion: we do not need an estimate
of µk , we need an estimate of µ = kµk . We must then compute V (k µ̂k ) and this is
V (k µ̂k ) = k 2 V (µ̂k ) = σ 2 /n
Exactly the same as with “aggregated” data. This should not surprise us: in fact the
arithmetic mean of log returns is simply given by the log of the ratio of the last to the
first price divided by the required number of data points. In other words it only changes
because of the denominator: n for a yearly mean and kn for a sub period of length k
mean. No information is added by using sub period data, hence no improvement in
the variance of the estimate.
In summary: the expected return is difficult to estimate for two reasons.
First: σ is expected to be much bigger than µ and the t-ratio depends on the ratio
of these.
Second: even if we increase the frequency of observations nothing changes for the
estimate of the (yearly) µ so that its sampling variance stays the same.
Now let us do a similar analysis for the variance. In order to make things simple we
shall suppose that µ is known and data are Gaussian. This allows us to find quickly
some useful results.
The general case (unknown µ) is given below, but nothing relevant intervene when
we remove the two simplifying hypotheses.
At the end of this section we shall also consider the smoothed estimate.
Let us compute the sampling variance of our variance estimate (known µ) and let
∗
ri be the yearly log return
n
X
1
1
1
(ri∗ − µ)2
) = V ((r∗ −µ)2 ) = (E((r∗ −µ)4 )−E((r∗ −µ)2 )2 ) = (µc4 −σ 4 )
V (σ̂ ) = V (
n
n
n
n
i=1
2
Where µc4 = E((r∗ −µ)4 ) is the fourth centered moment and without further hypothesis
could be any non negative constant. If the ri∗ are Gaussian we have µc4 = 3σ 4 and the
resulting variance of the sampling variance is
V(
X (r∗ − µ)2
i
i
n
52
)=
2 4
σ
n
So that the√sampling
√ standard deviation of the estimated variance shall be, with our
2
numbers, . 2σ / n.
The t-ratio for this estimate shall be (in the approximation we suppose the estimate
and the true variance non to differ too much)
√
√
σ̂ 2 n
√ ≈ .7 n
2
σ
2
√
In order to get a t-ratio equal to 2 we need n > 2/.7 and we get there with just n = 9
instead
of 64 as in the expected value case. For a t-ratio of 4 or greater we need now
√
n > 4/.7 and for this n = 33 suffices (instead of n = 256 for the above discussed case
of the expected value).
But there is much more: for estimating the variance the use of higher frequency
data improves the result.
Let our strategy be that of estimating yearly variance as k times the estimated
variance for a sub period of length1/k in yearly terms (the prime sign is to indicate
that this is a new estimate)
2
σ̂ 0 = kσ̂k2
Using the same notation and hypotheses as above we get
kn
∗
X
1
1
(rki
− µk )2
)=
V ((rk∗ −µk )2 ) =
(E((rk∗ −µk )4 )−E((rk∗ −µk )2 )2 ) =
V (σ̂k2 ) = V (
kn
kn
kn
i=1
1
2 4
2 σ4
(µkc4 − σk4 ) =
σk =
kn
kn
kn k 2
where we used the Gaussian hypothesis. Then
2 4
2
V (σ̂ 0 ) = k 2 V (σ̂k2 ) =
σ
kn
And we see that now k is in the formula: using sub period data improves the estimate.
Now the t-ratio for the variance estimates is approximately equal to
2√
√
σ̂ 0 kn
√
kn
≈
.7
σ2
2
So that the use of k sub periods per year has an effect identical to that of multiplying
the number of years by k. With, say, monthly data, we need less than one year (actually
9 months) for a t-ratio of 2 and slightly less that 3 years (just 33 months) so that t
ratio for the variance becomes greater that 4 instead of the 33 years with yearly data19 .
=
We see that if we decrease the observation interval, so that the frequency of observation per unit
period k increase, in the limit we get a sampling standard deviation of the variance equal to zero.
This should not be taken too seriously: the log random walk model, which underlies this result, may
be a good approximation for time intervals which are both not too long and not to short. Below the
1 day horizon we enter the world of intraday, trade by trade data which cannot be summarized in the
simple log random walk hypothesis.
19
53
Estimating σ 2 is then easier than estimating µ for two reasons:
First: the ratio (sometimes called t-ratio) between the estimate and its standard
deviation is bigger than that for the expected value, whatever be the n. This comes
from the empirical, and theoretical idea that expected return are much smaller than
volatilities.
Second: even if the first reason was not true, we still have the fact that using higher
frequency data improves (dramatically) the quality of the estimate for σ 2 while it is
irrelevant for the estimate of µ.
As mentioned above all our formula hold for known µand Gaussian log returns.
For the general case we have the following result:
with i.i.d. log returns not necessarily Gaussian, for the estimate
2
S =
n
X
(ri∗ − r̄∗ )2 /(n − 1))
i=1
we get
Var(S 2 ) =
µc4 σ 4 (n − 3)
−
n
n (n − 1)
which, for not too small n and a fourth centered moment not very different from
the Gaussian case, gives us the same result as the above formula.
Notice that in all these cases the sampling variance of the estimate of the variance
(as that of the estimate of the expected value) goes to 0 with n going to infinity.
Let us conclude with the case of the smoothed estimate.
P
We are going to use the approximation for the denominator given by: i=0,...,n λi ≈
P
i
i=0,...,∞ λ = 1/(1 − λ). The variance of the smoothed estimate is
X
X
∗2
∗2
V (Vt ) = V ((1 − λ)
λi rt−i
) = (1 − λ)2
λ2i V (rt−i
)=
i=0,...,n
=
i=0,...,n
1−λ
(1 − λ)2
(µ4 − µ22 ) =
2σ 4
2
1−λ
(1 + λ)
where the last equality is true if the expected value is zero (as assumed in RiskMetrics) and log returns are Gaussian (and recall:1 − λ2 = (1 + λ)(1 − λ)).
Here it is meaningless to compare this with the quality of the estimate for µ because
this is assumed equal to zero.
It is, however, interesting to compare the result with a result based on sub period
of length k. Everything depends on the choice of λ for the sub periods. If we set it to
λ1/k we have
X
∗2
Vkt = (1 − λ1/k )
λi/k rkt−i
i=0,...,kn
so that, following the same steps as for V (Vt )
54
V (Vkt ) =
1 − λ1/k σ 4
2
(1 + λ1/k ) k 2
and we have, for the estimate Vt0 = kVkt (we use the prime sign because this is
different with respect to the estimate using aggregated data)
V (kVkt ) = k 2
1 − λ1/k σ 4
1 − λ1/k 4
2
=
2σ
1 + λ1/k k 2
1 + λ1/k
This, for 0 < λ < 1 and k > 1, is always smaller than the variance computed using
only full period data (k = 1)20 .
Examples
Exercise 2 - volatility.xls Exercise 3 - risk premium.xls Exercise 3a - exp smoothing.xls
Exercise 3b - historical and implied volatility.xls Exercise 3c - volatility.xls
4
Non Gaussian returns
It can be argued that a reasonable decision maker should be interested in the probability
distribution of returns implied by the strategy the decision maker chooses. This should
be true even if in common academic analysis of decision under uncertainty the use of
polynomial utility functions tend to overweight the role of the moments of the return
distribution and in particular of the expected value and variance.21
In some cases, as for instance in the Gaussian case, the simple knowledge of expected
value and variance is equivalent to the knowledge of the full probability distribution. In
this case the expected value of any utility function shall only depends on the expected
value and the variance of the distribution (being these the only parameters of the
distribution).
Another way to say the same is that, in this case, if we are interested in the probability with which a random variable X can show values less than or equal to a given
value k, it is enough to possess the tables of the standard Gaussian cumulative density
function and compute:
Notice that, with the smoothed estimate, the sampling variance of the estimate does not go to 0
for n going to infinity. On the other hand the bigger k the smaller the sampling variance. Remember,
however, as said above, that k cannot be taken as big as you wish as the log random walk hypothesis
becomes untenable for very short time intervals between observations.
21
Due to linearity of the expected value, the expected value of a polynomial utility function
P (that is
aPlinear combinations of powers of the relevant variable) is a weighted sum of moments: E( i αi X i ) =
i
i αi E(X ).
20
55
k−µ
)
σ
Notice that, for distributions characterized by more that two parameters, as for instance
a non standardized version of the T distribution, this property is obviously no more
valid.
It is then of real interest to find good distribution models for stock returns and, in
particular, to evaluate whether the simplest and most tractable model: the Gaussian
distribution, can do the job.
A better understanding of the problem can be achieved if we consider that, in
most applications, we are not interested in the overall fit of the Gaussian distribution
to observed returns but only in the quality of fit for hot spots of the distribution,
mainly tails. In Finance the biggest losses are usually connected to extreme, negative,
observations (for an unhedged institutional investor). We shall see that the Gaussian
distribution while being, overall, not such a bad approximation of the underlying return
distribution, is not so for the extreme, say 1-2%, tails22 .
When studying stock returns, we observe extreme events, mainly negative, in the
order of µ minus 5 σ and more with a frequency which is incompatible with the probability of such or more negative events under the hypothesis of Gaussianity.
In these evaluations µ and σ are estimates using a long record of data.
While quite rare (do not be fooled by the fact that extreme events always make the
news and so become memorable) such extreme events are much more frequent than
should be compatible with a Gaussian calibrated on the expected value and variance of
observed data. For instance, the probability of a µ − 5σ or more negative observation
in a Gaussian is less than 0.00000028.
Let us consider an example based on a long series of I.B.M. daily returns.
Between Jan 2nd 1962 and Dec 29th 2005 the IBM daily return shows a standard
deviation of 0.016423 . In this time period for 14 times the return was below −5σ
(suppose a µ of 0 using the historical mean the number is even bigger). The number
of observations is 11013 so the observed frequency of a −5σ event is 0.00127, that is:
more than 4500 times the probability of such observations for a Gaussian with the
same standard deviation!
This is true for a very “mature” and “conservative” stock the like of I.B.M.
Obviously, a frequency of 0.00127 is very small, but the events on which it is computed (big crashes) are those which are remembered in the history of the market. It
is quite clear that, in this case, a Gaussian distribution hypothesis could imply a gross
Φ(
The Gaussian distribution can be a good approximation of many different unimodal distributions
if we are interested (as is true in many applications of Statistics) in the behaviour of a random variable
near its median. For modeling extreme events, having to do with system failures, breakdowns, crisis
and similar phenomena, a totally different kind of distribution may be required.
23
(data are in excel Exercise 2- IBM random walk).
22
56
underestimation of the probability of such events.
The behaviour of the empirical distribution of returns can be summarized in the
motto: fat tails, thin shoulders, tall head. In other words, given a set of (typically daily)
returns over a long enough time period (we need to estimate tails and this requires lots
of data) we can plot the histogram of our data on the density of a Gaussian distribution
with the same mean and standard deviation. What we observe is that, while overall
the Gaussian interpolation of the histogram is good, if we zoom on the extreme tails:
first and last two percent of data, we see that the tails of the histogram decrease at
a slower rate than those of the Gaussian distribution. Moreover, toward the center of
the distribution, we see how the “shoulders” of the histogram are thinner than those of
the Gaussian and, correspondingly, the histogram is more peaked around the mean.
The following plots are from the excel file “Exercise 4 - Non Gaussian returns”, in
this worksheet we use data from May 19th 1995 to Sep 28th 2005 on the same I.B.M.
series as before.
The first plot compares the interpolated Histogram of empirical data with a Gaussian density with the same mean and variance as the data. You can clearly see the
mentioned “fat tails, thin shoulders”. Since tails, fat or not, are tails that is: they are
thin, in the second plot we focus on the extreme left tail and at this scale the difference
between the empirical and the Gaussian distribution. The x axis is scaled in terms of
standard deviation units (1 means 1 standard deviation) and we see that, moving to
the left starting at, roughly, 2, the empirical tail is above the Gaussian tail: extreme
observations are more frequent then what we would expect in a Gaussian distribution
with the same mean and variance as the data.
57
58
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
0,18
0,2
Empirical VS Gaussian density. I.B.M. data
Standard Gaussian density
Relative frequency
59
-6
-5
-4
-3
-2
-1
Left tail empirical VS gaussian CDF. I.B.M. data
0
0,02
0 04
0,04
0,06
0,08
0,1
0,12
0
Gaussian CDF
Empirical CDF
Another way to compare the empirical distribution with a Gaussian model (or any
model you may choose) is the Quantile-Quantile (QQ) plot. In the worksheet you find
the standardized version of the plot. In order to build a standardized QQ plot from
data you must first choose a comparison distribution, in our case the Gaussian. The
second step is that of standardizing the data, using some estimate of the data expected
value and variance. The standardized dataset is then sorted in increasing order an the
observations in this dataset shall be the X coordinate in the plot. For each observation
of the standardized returns dataset, compute the relative frequency with which smaller
than or equal values were observed. Compute then, using some software version of the
standard Gaussian CDF tables, the value of the standard Gaussian which leaves on its
left exactly the same probability as the relative frequency left on its left by the X data,
this shall be the corresponding Y coordinate in the plot.
60
61
-10
-8
-6
-4
-2
-8
-6
-4
-2
0
2
4
6
8
10
0
2
4
6
Quantile Quantile Plot. I.B.M. data.
8
Standardized sorted returns
Standard Gaussian equivalent obs
In the end what you see is a curve of coordinates X,Y. If the curve is a bisecting
straight line, your empirical CDF is approximated well by a Gaussian CDF. Departures
from the bisecting line are hints of possible non Gaussianity. To facilitate the reading of
the plot, a bisecting line is added to the picture. In a second, equivalent, version of the
plot the X coordinate is the same but on the Y axis we plot the difference between the
Y as computed for the previous plot and the bisecting line, This is called a “detrended”
QQ plot.
For the I.B.M. data we see how, on the left tail, observed data are above the
diagonal meaning a left tail heavier than the Gaussian. On the opposite side of the
plot we see how the QQ plot lies below the bisecting line. Again: this means that
we are observing data far from the mean and on the right side of it with an higher
frequency than compatible with the Gaussian hypothesis.
Since the data are standardized, the scale of the plot is in terms of number of
standard deviations. We see that, on the left tail, we even observe data near and
beyond -6 times the standard deviation. The tail from minus infinity to -6 times the
standard deviation contains a probability of the order of 5 divided by one billion for the
standard Gaussian distribution. We also observe 10 data points on the leftmost −5σ
tail. Since our dataset is based on 10 years of data, roughly 2600 observations, if we
read our data as the result on independent extractions from the same Gaussian, these
observations, while possible, are by no means expected as the probability of observing
10 times, in 2600 independent draws, something which has in each draw a probability
of 0.00000028 to be observed is virtually 024 .
We can also follow a different, strongly related, line of thought. We see that in this
dataset made of about 2600 daily observations we observe a extreme negative return of
around −8σ. This is the most negative return hence the minimum observed value. Now
let us ask the following question: what is the probability of observing such negative a
minimum if data come from a Gaussian? suppose data are iid and distributed according
to a (standardized) Gaussian. In this case the probability of observing data below the
minus 8 sigma level is Φ(−8) and this for each of the 2600 observations. However
the probability of observing AT LEAST a value less than or equal this is 1 minus the
probability of never observing such value, that is, due to iid: 1 − (1 − Φ(−8))2600 ). It
is clear that 1 − Φ(−8) is almost 1 but (1 − Φ(−8))2600 is much smaller. Is it small
enough to make 1 − (1 − Φ(−8))2600 ) big enough so that a minimum value of −8σ
To understand this use the binomial distribution. Question: suppose the probability of observing a
−5σ in each of 2600 independent “draws” is 0.00000028.
the probability of observing 10 such
What is 10
events? The answer,computed with Excel is: 2600
0.00000028
(1 − 0.00000028)2590 = 0, 0000....
10
Meaning that, at the precision level of Excel, we have a 0! While the exact number is not 0 this
means that, at least in Excel the actual rounding error could be quite bigger that the result. For all
purposes the answer is 0. Question: in this section we evaluated the “un-likelihood” of −5σ results in
two different ways: first with a ratio between frequency and Gaussian based probability, then using
the binomial distribution and, again, the Gaussian based probability. What is the connection between
these two, different, arguments?
24
62
over 2600 iid observation from a standard Gaussian be not termed “anomalous”? The
computation is not to simple as Φ(−8) is a VERY small number and the precision of
the Excel routine for its computation cannot be guaranteed. However using Excel we
get (1 − Φ(−8))2600 = .999999999998268 so that even if we take into account the 2600
observations the probability of observing as minimum of the sample a −(σ data point
is still not really different from 0. I checked the result using Matlab (whose numerical
routines should be more precise than Excel’s) getting a very similar result.
In order to get 1 − (1 − Φ(−8))n ) in the range of .01 (still very unlikely) we would
need n =15,000,000,000,000. These are open market days and would correspond to
roughly to 59 billions of years. This is a time period roughly 4 times the current
estimate of the age of our universe. (Again: beware of roundings!). In any sense a −8σ
value is quite unlikely if data come from a standard Gaussian.
It should be noticed, as a comparison, that for a T distribution with, say, 3 degrees
of freedom the probability never observing a return of -8σ over 2600 days is only
0.347165227 so that the observed minimum (or a still smaller value) has a probability
of 0.652834773 that is: by no means unlikely (in doing this computation recall that
the Student’s T distribution variance is ν/(ν − 2) where ν is the number of degrees
of freedom
so that the quantile corresponding to −8 in a standard Gaussian is, now,
p
−8 ν/(ν − 2)).
As you can see, while at first sight similar to the Gaussian, the T distribution is
VERY different wrt the Gaussian when tail behaviour is what interests us.
In the following section we shall consider the relevance of these empirical facts from
the point of view of VaR estimation.
Examples
Exercise 4 - Non normal returns.xls Exercise 4b - Non normal returns.xls
5
Four different ways for computing the VaR
First it is necessary to define the VaR25 .
Suppose a sum W is invested in a portfolio at time t0 and we are interested in the
p&l (profit and losses) between t0 and t1 that is: Wt1 − Wt0 . In all the (for us) relevant
cases, that is when some “risk” is involved, this p&l shall be stochastic due to the fact
that Wt1 as seen at t0 is stochastic. Our purpose is to give a simple summary of such
stochastic behaviour of the p&l aimed at quantifying our “risk” in a possibly immediate
way.
25
For this section I refer to the worksheets: exercise 4, 4b and 5.
63
Many such measures can be (and have been) suggested. The RiskMetrics procedure
chose as its basis the so called VaR “Value at Risk”.
Given a level α (usually very small: 1% to 5% as a rule) of probability, the VaR is
defined as the α-quantile of the distribution of Wt1 − Wt0 .
The definition of a α-quantile xα for the distribution of a random variable X is easy
to write down and understand when the distribution of X : FX (x) is continuous a
strictly increasing at least in an interval xl , xu such that FX (xl ) < α < FX (xu ),
In this case we simply have
xα ≡ x :
P (X ≤ x) = FX (x) = α
and
xα = FX−1 (α)
Where the inverse of FX (x), that is:FX-1 (α) is defined in a unique way, continuous
and strictly increasing at least for FX (xl ) < α < FX (xu ).
In this case the α-quantile is nothing but the value of X which corresponds to a
cumulated probability exactly equal toα and we indicate such value with xα .
In the case of a cumulative distribution function with jumps (corresponding to
probability masses concentrated in specific values of x) there may be no x such that
FX (x) = α for a given α.
In this case the convention we use here is that of setting xα equal to the maximum
of the values x of X such that x is of positive probability and FX (x) ≤ α.
Barring this possibility, it is correct to say that the VaR at level αfor a time horizon
between t0 and t1 of your investment is that profit and loss such that the probability of
observing a worse one is equal to α.
This definition seems to imply that we are required to directly compute a quantile
of the p&l. This is not the case.
In fact what is required is a quantile of the return distribution.
Indeed we have
Wt1 − Wt0 = Wt0 rt0,t1
∗
Wt1 − Wt0 = Wt0 (ert0,t1 − 1)
∗
Where rt0,t1 and rt0,t1
are, respectively, the linear and the log return in the time
interval from t0 to t1 .
Since the functions return->p&l are both continuous and strictly increasing, the
problem of finding the required quantile of the p&l is equivalent to the problem of
finding the same in the distribution of returns and transform it back to p&l.
In this section we shall consider four different estimates of the VaR which rely on
different sets of hypotheses. Each estimate shall be presented in a very simple form,
the reader is warned that actual implementation of any of these es
64
5.1
5.1.1
Gaussian VaR
Point estimate of the Gaussian VaR
A word of notice, in what follows we shall use R as the symbol meaning the random
variable return and r as a possible value of such random variable. To avoid heavy
notation we shall not indicate the kind of return we are speaking about or the time
interval considered. Both these informations shall be clear from the context.
Gaussian VaR is the most restrictive setting used in practice. We suppose that
R is distributed according to a Gaussian density with expected value and variance
(µ, σ 2 ) which are either known or estimated in such a way to minimize sampling error
problems. A typical attitude is that of setting µ = 0 and estimating σ 2 , for instance,
using the smoothed estimate described above26 .
The important point to remember with the Gaussian density is that, under this
hypothesis, knowledge of mean and variance is equivalent to the knowledge of any
quantile.
Under the Gaussian Hypothesis, the CDF is continuous so we can find a quantile
with exactly α probability on its left for any α.
The procedure is simple: we must find rα such that:
P (R ≤ rα ) = α
And proceeding with the usual argument, already well known from confidence intervals
theory, we get:
P (R ≤ rα ) = α = P ((R − µ)/σ ≤ (rα − µ)/σ) = Φ((rα − µ)/σ) = Φ(zα )
Where zα is the usual α quantile for the standard Gaussian CDF Φ(.).
We have, then
(rα − µ)/σ = zα
rα = µ + σzα
This is quite easy. The problem is that, for small values of α we are considering
quantiles very far on the left tail and our previous empirical analysis has shown how
the Gaussian hypothesis for returns (overall not so bad) is inadequate for extreme tails.
Typically the problem of fat tails shall imply a undervaluation of the VaR.
This hypothesis is not reasonable for linear returns, which are bounded below. It is however
sometimes used in this case too.
26
65
5.1.2
Approximate confidence interval for the VaR
Now a problem: we do not know both µ and σ. We must estimate them. The usual
RiskMetrics procedure sets µ = 0 and estimates σ with the smoothed estimates introduced above. The estimate of the quantile of the return distribution, that is rα = σzα ,
shall then be rbα = σ̂zα .
According to sound statistical practice we should implement this with a measure of
sampling variability.
Here we show a possible approximate and simple way to do so by computing a lower
confidence bound..
Under the assumptions of uncorrelated observations with constant variance and
zero expected value, it is easy to compute the variance of rbα2 = σ̂ 2 zα2 .
In fact we have
V (b
r2 ) = z 4 V σˆ2
α
α
During the discussion about the different precision in estimating E(r) = µ and
V (r) = σ 2 we derived for the Gaussian zero µcase the formula
P
2i
λ
i=0,...,n
4
V σˆ2 = P
2 2σ
i
i=0,...,n λ
Using the approximation
2i
λ
i=0,...,n
1−λ
2 h
P
(1 + λ)
i
i=0,...,n λ
P
This becomes
1−λ
V σˆ2 =
2σ 4
(1 + λ)
So that
V (b
rα2 ) = zα4
1−λ
2σ 4
(1 + λ)
We can then estimate the σ 4 term by taking the square of the estimate of the variance
and get
1−λ
V̂ (b
rα2 ) = zα4
2σ̂ 4
(1 + λ)
A possible approximate confidence lower bound for the squared quantile estimate,
with the usual “two sigma” rule, is given by (minus) the square root of a two sigma one
sided interval for rbα2 .
66
s
s
"
#
q
1
−
λ
1
−
λ
rbα2 + 2 V̂ (b
rα2 ) = rbα2 + 2σ̂ 2 zα2
2 = rbα2 1 + 2
2
(1 + λ)
(1 + λ)
is a (upper) confidence bound for the square of the quantile estimate. In order to
convert it into a (lower) bound for the quantile estimate we simply take
v
s
u
u
1−λ
2
rbα t1 + 2
(1 + λ)
Now let us see some numbers.
Suppose an estimate of the (daily) σ as the one obtained above from I.B.M. data
between 1962 and 2005: 0.0164. You are computing a daily Gaussian VaR withα =
0.025. In this case z0.025 = −1.96
This gets a quantile point estimate equal to
rbα = σ̂zα = 0.0164 ∗ (−1.96) = −0.0321
let us assume that our variance estimate comes from a typical implementation of
the smoothed estimate formula with daily data, n = 256 (meaning roughly one year of
data) and λ = 0.95.
r
q
1−λ
2 = 1.2053 and the bound shall be, roughly,
In this case we have 1 + 2 (1+λ)
20% more negative that the point estimate of the quantile, that is
v
s
u
u
1−λ
rbα t1 + 2
2 = −0.0321 ∗ 1.2053 = −.0387
(1 + λ)
r
q
1−λ
2 = 1.2053 only depends on the choice of λso that it can
Notice that 1 + 2 (1+λ)
be precomputed for any estimate sharing the same choice of λ.
What we found is a confidence bound for the quantile of the (log) return.
In order to transform this into a confidence bound for the VaR we need to know
the amount invested W at time t0 .
The bound to the VaR shall be W ∗ (e−.0387 − 1) = W ∗ (−.0380), that is a loss of
3.8% (and here the use of % is correct, why?)
In order to further understand the consequences of using the smoothed estimate
consider the case of the “classic” estimate with λ = 1 in
P
2i
i=0,...,n λ
4
V σˆ2 = P
2 2σ
i
i=0,...,n λ
67
so that
2
σ4
n+1
V σˆ2 =
and the bound shall be
s
rbα
r
1+2
2
n+1
and, due to n, we quickly have that the extreme of the interval
q becomes almost identical
√
to the point estimate. For n = 256 we already get rbα
−0.0348.
5.1.3
2
1 + 2 √257
= rbα ∗ 1.084 =
An exact interval under stronger hypotheses (not for the exam)
For those interested in “exact” confidence intervals, we can derive a more formally
strong result using the following theorem:
Theorem 5.1. If {X1 , X2 , ..., Xn } are iid Gaussian random variables with expected
value µ and standard deviation σ, then
2
S =
2
n X
Xi − µ
σ
i=1
is distributed according to a Chi square distribution with n degrees of freedom.
This implies that, if we estimate the variance with the non smoothed sample variance (with µ = 0):
n
X
ri2
V ˆ(r) =
n
i=1
we have that
n
nV ˆ(r) X ri2
=
≈> χ2n
2
σ2
σ
i=1
so that
P(
nV ˆ(r)
nV ˆ(r)
2
≤
χ
)
=
P
(
≤ σ2) = 1 − β
n,1−β
σ2
χ2n,1−β
where χ2n,1−β is the 1 − β quantile of the χ2n distribution.
From this we see that a β confidence interval lower extreme for the α quantile is
given by
s
nV ˆ(r)
bα = −
zα
Lr
χ2n,1−β
68
With the same numbers we used above, and using a β = .025 so that χ2n,1−β =
χ2256,.975 = 213.5747 an α = .025 this becomes
r
bα = −
Lr
256 ∗ 0.01642
1.96 = −.0352
213.5747
against a point estimate of .0321 (here we drop the minus sign).
Since this lower bound estimate is based on the non smoothed estimate of the
variance it can be compared with the corresponding approximate bound found in the
previous section.
The bound was -0.0348 that is very similar to the one derived here in a rather
different way.
5.2
5.2.1
Non parametric VaR
Point estimate
The non parametric VaR estimate stands, in some sense, at the opposite of the Gaussian
VaR. In the non parametric case we suppose only that returns are i.i.d. but we avoid
assuming anything about the underlying distribution.
However, in order to find the VaR we need an estimate of the unknown theoretical
distribution.
In standard parametric settings, where we assume, e.g., normality, this is done by
estimating parameters and then computing the required probabilities and finding the
required quantiles using the parametric model with estimated parameters. Since we
now are making no specific assumption about the return distribution we need to find
an estimate of it which is “good” whatever the unknown distribution be.
The starting point of all non parametric procedures is to estimate the theoretical
distribution using the empirical distribution function. Suppose we have a sample of n
i.i.d. returns with common distribution F (.) which yield observed values {r1 , r2 , ..., rn }
then our estimate of F (.). shall be:
n
#ri ≤ r X
=
I(ri ≤ r)/n
P̂ (R ≤ r) = F̂R (r) =
n
i=1
Where: #ri ≤ r means ’ the number of observed returns less than or equal to r
and I(ri ≤ r) is a function which is equal to 1 if ri ≤ r and 0 otherwise.
Under our hypothesis of i.i.d. returns with unknown distribution F (.) the above
defined estimate works quite well in the sense that
E(
n
X
i=1
I(ri ≤ r)/n) =
n
E(I(ri ≤ r) = P (ri ≤ r) = F (r)
n
69
and
V(
n
X
i=1
I(ri ≤ r)/n) =
n
V (I(ri ≤ r)) = F (r)(1 − F (r))/n
n2
where the last passage depends on the fact that, for given r, I(ri ≤ r) is a Bernoulli
random variable with P = F (r).
Given this estimate of F , the non parametric VaR is, in principle, very easy to
compute.
Order the observed ri in an increasing way, then define r̂α as the smallest ri such
that the observed frequency of data less than or equal to it is α, if such ri exists. This
ri does not exists if α is not one of the observed values of cumulative frequencies, that
is: if there exists no ri such that F̂R (ri ) = α. In this case we make an exception with
respect to the common definition of empirical quantile and define r̂α as the biggest
observed ri such that F̂R (ri ) < α.
(Linear or other interpolations between consecutive observations are frequently used
but we shall consider this in the “semi parametric VaR” section). This is nothing but
a possible definition for the inversion of the empirical CDF27 .
The problem with this estimate is that, if α is small, we are considering areas of the
tail where, probably, we made very few observations. In this case the estimate could
be quite unstable and unreliable. The reader should compare this estimate with the
estimate of a quantile in the Gaussian case. In the Gaussian case we estimate quantiles
inverting the CDF which, on its turn, is estimated indirectly, by estimating µ and σ,
the unknown parameters. This implies that any data point tells us something about
any point of the distribution (maybe very far from the observed point) as it contributes
to the estimate of both parameters. In other terms, a parametric hypothesis allows
us to estimate the shape of the distribution in regions where we do not make any
observations. Instead, in the non parametric case, each data point, in some sense, has
only a “local” story to tell. To be more precise: the non parametric estimate of the
CDF at a given point r does not change if we change in any way the values of our data
provided we keep constant the number of observations smaller than and greater than
r.
So, we use very little information from the data in a non parametric estimate while
the influence of any data point on a parametric estimate is big. An unwritten law of
Statistics is that, if you use little information you are going to get an estimate which is
robust to many possible hypotheses of the data distribution but with a high sampling
variability; on the other hand, if you use a lot of information in your data, as you do in
a parametric model, you are going to have an estimate which is not robust but with a
Our choice does not correspond to some definition of empirical quantile you may find in Statistics
books. In particular, in the case where no ri exists such that F̂R (ri ) = α the empirical quantile is
sometimes defined as the smallest observed ri such that F̂R (ri ) > α. This would be not proper for our
purpose which is to estimate the size of a possible loss and, if needed, exaggerate it on the safe side.
27
70
smaller sampling variability. This is what happens in the case of non parametric VaR
when compared to, say, Gaussian VaR.
5.2.2
Confidence interval for the non parametric quantile estimate
Let us study a little bit these properties by computing a one side confidence interval
for the αquantile rα on the basis of a simple random sample of size n.
Suppose we order our data from the smallest to the biggest value (that is: we
compute the “order statistics” of the sample). Call the j − th order statistic r(j) .
Using the “integer part” notation where [c] is the largest integer smaller that or
equal to c, the above defined estimate of rα can be simply written as r̂α = r([nα]) .
This means that our estimate is the nα ordered observation, quantile if nα is an
integer, or the ordered observation corresponding to the largest integer smaller that
nα.
This is a sensible choice but, as usual, due to sampling variability, the estimate
could be either more or less negative than the “true” (and unobservable) rα .
We are would be quite worried if rα ≤ r([nα]) (the = is here for cautiousness sake).
A possible strategy in order to lower the probability of this event is that of building a
lower bound for the estimate based on a r(j) with j < [nα]
In order to choose this we must answer the following question. What is the probability that the “true” α quantile, say rα is smaller than or equal to any given j − th
order statistic (ordered observation) r(j) in the sample? If this event happens and we
used this quantile as estimate, the estimate shall be wrong in the sense that we shall
undervalue the possible loss (as note above, the “equal” part we put in as an added
guarantee).
This error is going to happen if, by chance, we make at most j observations smaller
than, or equal to rα . In fact when, for instance, the number of observations less than
rα is, say, j − 1 the jth empirical quantile shall be bigger (less negative) that rα (or
equal to, see the previous sentence).
Since observations are iid and, supposing a continuous underlying distribution, the
probability of observing a return less than or equal to rα is, by definition, α, the
probability of making exactly i observations less than (or equal to) rα (and so n − i
bigger than rα ) is
n
αi (1 − α)n−i
i
We then have that the probability of making at most j observations less than (or equal
to) rα , that is, the probability that r(j) be greater than or equal to rα is equal to the
sum of the probabilities of observing exactly i returns smaller that or equal to rα for
i = 0, 1, 2, ..., j. For i = 0 all observations are greater than rα ; for i = 1 only the
smallest observation is smaller that or equal to rα and so on up to i = j where we have
exactly j observations smaller than or equal to rα (we are including the case r(j) = rα
71
because we want to be on the ”safe side” and avoid a possible undervaluation of the
risk). Obviously, from i = j + 1 onward, we have j + 1 or more observations smaller
than or equal to rα , so that r(j) shall be, supposing the probability of “ties” (identical
observations) equal to 0 as in the case of a continuous F , strictly smaller that rα .
In the end, the probability of “making a mistake” in the sense of undervaluing the
possible loss, that is the probability of choosing an empirical quantile r(j) greater than
rα , is given by:
P (r(j) ≥ rα ) =
j X
n
i=0
i
αi (1 − α)n−i
Now the confidence limit: to be conservative, we want to estimate rα with an empirical
quantile r(j) such that we have a small probability β that that the true quantile rα is
smaller than its estimate. This, again, is because we are willing to overstate and not
to understate risk hence, we “prefer” to choose an estimate more negative than rα that
a less negative one. Obviously, we would also like not to exaggerate on the safe side.
Our strategy shall be as follows: we choose a r(j) such that P (r(j) ≥ rα ) ≤ β for
a given β which represents with its size how much we are willing to accept an under
estimation of the risk (the smaller the β the more adverse we are at underestimating
rα ). On the other hand we do not want j to be smaller (that is r(j) more negative)
than required. Summarizing this we must solve the problem
max(j) : P (r(j) ≥ rα ) ≤ β
This is going to be the extreme of our one tail confidence interval.
this: the expected value of the random variable i with probability function
Notice
n
i
α (1 − α)n−i is equal to nα so that we could “expect” the empirical quantile cori
responding to the index j just smaller than or equal to nα to be the “right choice”
(and exactly this choice of point estimate was made in the previous paragraph). e.g.
if α = .01 and n = 2000, intuitively we could use as an estimate of rα the empirical
quantile r(20) .
However, if we make this choice, we are going (for n and α not too small) to have
roughly fifty fifty probability that the true quantile is on the left or on the right of the
estimate. This is due to the central limit theorem according to which
j X
n
i=0
i
αi (1 − α)n−i ≈ Φnα;nα(1−α) (j)
If the approximation works for our n and α we see that nα becomes the mean of
(almost) a Gaussian, hence the probability on the right and on the left of this becomes
.5.
For reasons of prudence fifty/fifty is not good for us, we go for a smaller probability
that the chosen quantile be bigger than rα , that is for a β smaller than .5.
72
For this reason we choose an empirical quantile corresponding to a smaller j to the
j just smaller than (or equal to) nα and we do this according to the above rule.
The just quoted central limit theorem, if n is big and α not too small, simplifies
our computations with the following approximation:
P (r(j) ≥ rα ) =
j X
n
i=0
i
= Φ0;1
αi (1 − α)n−i ≈ Φnα;nα(1−α) (j) =
j − nα
!
p
nα(1 − α)
With this approximation, we want to solve
max(j) : Φ0;1
j − nα
p
nα(1 − α)
!
≤β
So that our solution is given by the biggest (integer) j such that √ j−nα ≤ zβ or,
nα(1−α)
p
that is the same, the biggest (integer) j such that j ≤ nα + nα(1 − α)zβ .
Using the more compact “integer part” notation and calling r̂αβ our lower bound,
we have:
r̂α,β = r([nα+√nα(1−α)z ])
β
p
Notice that [nα + nα(1 − α)zβ ] does not depend on the observed data but on
α, β, n only. Hence, the solution, in terms of j, that is: which ordered observation to
use, (obviously NOT in terms of r(j) ) it is known before sampling.
Suppose, for instance, you have 1000 observations and look for the 2% VaR. The
most obvious empirical estimate of the 2% quantile is the 20th ordered observation,
but, according to the central limit theorem, the probability that the true 2% quantile
is on its left (as on its right but this is not important for us) is 50%.
To be conservative you wish for a quantile which has only 2.5% probability of being
on the right of the 2% quantile. Hence you choose a β of 2.5% (zβ = −1, 96) and you
get
p
p
nα + zβ nα(1 − α) = 1000 ∗ .02 − 1.96 ∗ 1000 ∗ .02 ∗ .98) = 11.32
According to this result your choice for the lower (97.5%) confidence bound for the
(2%) VaR is given by the [11.32] = 11-th ordered observation that is, roughly, the 1%
empirical quantile.
Beware: do not mistake α for β. The first defines the quantile you want to estimate
(rα ) and the second the confidence level of the confidence interval.
Is this prudential estimate much different w.r.t. the simple “expected” quantile?
It depends on the distance between ranked observations on the tail for observed cumulative frequencies of value about α. If the tail goes down quickly the distance is
73
small and the difference between, in this case, the 11th and the 20th quantile shall not
be big. On the contrary, with heavy tails the difference between the 1% and the 2%
empirical quantile can be quite big.
As an example consider the case of the I.B.M. data between May 19th 1995 to Sep
28th 2005 discussed above.
The point estimates of the 2.5% and 1% quantiles are the 2.5% empirical (ranked
obs 66th) quantile is -4.105% and the 1% empirical quantile (ranked obs 26th) is -5.57%.
These point estimates correspond to 97.5% confidence bonds of -4.67% (obs 50) and
-6.53% (obs 16). In the first case roughly .5% more negative than the point estimate,
in the second case 1%. The reason for the difference is that around the 1% empirical
quantile observations are more “rarefied”, hence with large intervals in between, than
around the 2.5% empirical quantile.
With the same data, a Gaussian VaR estimate and using, for comparison, the full
sample standard deviation as estimate of σ (value 0.021421), we get, for the the 2.5%
VaR, a point estimate of -4.14%, to be compared with the 2.5% empirical quantile
-4.105% (confidence bound -4.67%). However in the Gaussian case the (approximate
2σ) lower confidence limit , given the more than 2600 observation and the unsmoothed
estimate, is -4.23%: very similar to the point estimate. As we did see a moment ago
this is not true for the empirical VaR (.5% of difference between the estimate and the
confidence limit).
Things are worse on more extreme quantiles.
If we compute the 1% quantile in the Gaussian case, we get -4.93% with a (two σ)
bound of -5.02% to be compared with the non parametric -5.57% and the corresponding
bound of -6,53%.28 .
When we are evaluating extreme quantiles two “negative” forces sum. First the
empirical distribution is very “granular” in the tails (very few observations). Second
the empirically observed heavy tails imply the possibility of considerable difference
between contiguous quantiles, bigger that expected in the case of Gaussian data.
Non parametric VaR, sometimes dubbed “historical” VaR because it uses the observed history of returns in order to estimate the empirical CDF, is probably the most
frequently used in practice. Again, confidence limits as often ignored and this could be
due to their dismal “big” size.
The problem of a big sampling variance for such estimates is very well known.
Applied VaR practitioners and academics have suggested in the last years, an amazing
quantity of possible strategies for improving the quality of the non parametric tail
estimate. Most of these suggestion fall in two categories, semi parametric modeling of
the distribution tails and filtered resampling algorithms.
These estimates may change very much if we change the sample. For instance, with a longer
stretch of data: between 1962 and 2005, the standard deviation is 0.0164, the 2% Gaussian var is
-3.37% while the 2% empirical quantile is -3.4%.
28
74
In the following subsection we shall consider a simple example of semi parametric
model. The resampling approach is left for more advanced courses.
5.3
Semi parametric VaR
Semi parametric VaR mixes a non parametric estimate with a parametric model of the
left tail.
The aim is that of conjugating the robustness of the non parametric approach with
the greater efficiency of a parametric approach.
As we did see in the previous section, the non parametric approach can result in
good evaluations of the VaR for values of α not very small. For small α its sampling
variability may be non negligible even for big samples. However, in VaR computation
we look for a quantile estimate for small α. The idea of a semi parametric approach is
that of using a parametric model just for the tail of the distribution beyond a small
but not to small αquantile. The plug in point of the parametric model is estimated in
a non parametric way, from that point onward a parametric model is used in order to
estimate the required extreme quantile.
The reason why this may work is that, on the basis of arguments akin to the central
limit but considering not means but extreme order statistics, we can prove that, while
we may have many different parametric models for probability distributions, the tails
of such distributions behave in a way that can be approximated with few (typically
three) different parametric models.
Here is not the place to introduce the very interesting topic of “extreme value
theory”, a recent fad in the quantitative Finance milieu, so, we shall not be able to
fully justify our choice of tail parametric model. Be it sufficient to say that a rigorous
justification is possible.
We suppose that for r negative enough:
P (R ≤ r) = L(r)|r|−a
where, for such negative enough r, L(.) is a slowly varying function for r → −∞
= 1∀λ > 0 and you can understand this as implying
(Formally this means limr→−∞ L(λr)
L(r)
that the function L to be approximately a constant for big negative values of r) and a
is the speed with which the tail of the CDF goes to zero with a polynomial rate.
This is sometimes called a “Pareto” tail because a famous density showing this tail
behaviour bears the name of Vilfredo Pareto.
This choice of tail behaviour could be justified on the basis of limit theory, as hinted
at before, or on the basis of good empirical fitting to data.
Notice that the Gaussian CDF has exponential tails, which go to 0 much faster
than polynomial tails. Pareto tails are, thus, a model for “heavy” tails.
Provided we know where to plug in the model (that is: which value of r is negative
enough) our first task is that of estimating a, the only parameter in the model. In
75
order to do so we take the logarithm of the previous expression and we get:
log(P (R ≤ r)) = log(L(r)) − a log(|r|)
We then assume that, maybe with an error, log(L(r)) can be approximated by a constant C:
log(P (R ≤ r)) ≈ C − a log(|r|)
this expression begins to be similar to a linear model. In fact, if, in correspondence of
any observed ri we may estimatelog(P (R ≤ ri )) with log(F̂R (ri )) and summarize the
various approximations in an error term ui , we have:
log(F̂R (ri )) = C − a log(|ri |) + ui
A linear regression based on this model shall not work for the full dataset of returns,
but it shall work for a properly chosen subset of extreme negative returns. A simple
way to find the proper subset of observations is that of plotting log(F̂R (ri )) against
log(|ri |) for the left tail of the distribution. Typically this plot shall show a parabolic
region (consistent with the Gaussian hypothesis) followed by a linear region (consistent
with the polynomial hypothesis). The regression shall be run with data from the second
region.
Suppose we now have an estimate for a, how do we get an estimate of the quantile?
the problem is that of plugging in the parametric tail to the non parametric estimate
of the CDF.
The solution is simple if we suppose to have a good non parametric estimate for
the quantile rα1 where α1 is too big for this quantile estimate to be used as VaR.
What we need is an estimate of rα2 for α2 < α1 . If we suppose that the tail model is
approximately true for both quantiles we have:
a
L(rα1 ) rα2
α1
=
α2
L(rα2 ) rα1
L(r
)
But the ratio L(rαα1 ) should be very near to 1 (the same slow varying function computed
2
at non very far away points) so that we can directly solve for rα2 :
rα2 = rα1
α1
α2
a1
Given the non parametrically estimated rα1 , an estimate of a (based on the above
described regression) and for a chosen α2 we are then able to estimate the quantile rα2 .
76
77
-9
-8
-7
-6
-5
-4
-3
-2
Log-Log plot of extreme negative (in absolute value) data
See how the linear hypothesis seems to work on the left of -3/-4 sigma
-1
-5
-4,5
-4
-3,5
-3
-2,5
-2
-1,5
-1 5
-1
-0,5
0
0
5.3.1
Confidence interval for the semi parametric estimate (not for the
exam).
We are interested in a lower bound for a. In fact, we see from the above formula that
the bigger is a (which is positive by definition) the faster shall be the decline of the
log CDF and the nearer the quantile. So, our risk is to exaggerate a due to sampling
error.
If we suppose, with some excess of faith, that the model
log(F̂R (ri )) = C − a log(|ri |) + ui
satisfies the hypotheses required for a linear model (which we shall analyze in detail
during this course) we have that the best estimate (the OLS estimate) of −a shall be
−â =
Cov(log(F̂R (ri )); log(|ri |))
V ar(log(|ri |))
and the sampling variance of this shall be given by
V ar(−â) =
V ar(ui ) 1
V ar(log(|ri |)) n
and V ar(ui ) under standard OLS hypotheses can be estimated on the basis of the
errors of fit of the linear model given by
ûi = log(F̂R (ri )) − Ĉ + â log(|ri |)
using the formula
ˆ i) =
V ar(u
n
X
û2i /(n − 2)
i=1
The lower bound of a 1 − β one sided interval for a shall be given by
s
ˆ i) 1
V ar(u
âL = â −1−β tn−2
V ar(log(|ri |)) n
This implies a lower bound for the estimated quantile given by
â1
α1 L
L rα2 = rα1
α2
A detailed discussion of this method with suggestions for the choice of the subset of
data on which to estimate the tail index, and formulas for more sophisticated confidence
intervals may be found in 29
“Tail Index Estimation, Pareto Quantile Plots, and Regression Diagnostics”, Jan Beirlant, Petra
Vynckier, Jozef L. Teugels, Journal of the American Statistical Association, Vol. 91, No. 436 (Dec.,
1996), pp. 1659-1667. Jstore link http://www.jstor.org/stable/2291593
29
78
However the reading of this paper is not required for the course.
A comparison of Gaussian, non parametric and semi parametric VaR is shown in
detail in the Excel worksheet Exercise 5 VaR.xls.
5.4
Mixture of Gaussians
An intuitive idea any observer of the market could hold is that trading days are not
all the same. Most days go unnoticed while, for a small number of days, the market
seems to work at a fast rate, time seems to pass faster and volatility is higher. This
common notion is consistent with the fact that relevant information does not arrive to
the market as a continuous, uniform flow but rather in bits and chunks.
A possible very simple formalization of this observation is as follows.
There exist two type of day: 1 and 2. Conventionally we shall label with 1 “standard” days, and with 2 “fast” days. Suppose that in both days the distribution of
returns is Gaussian with the same expected value µ but with different variances:σ12 , σ22 .
We do not know, either a priori or a posteriori, in which type of day we are going, or
did, live. Both days are compatible with any kind of return even if, obviously, returns
far in the tails are more likely in fast days.
In other terms we do not observe data from the two separate distributions of returns
but a mixture of data. The density of this mixture shall be given by the mean of the
two Gaussian densities with weights P and 1 − P where P is the probability of being
in a ordinary day. If we indicate with N (r; µ, σ 2 ) the Gaussian density with expected
value µ and variance σ 2 we can compute the marginal density for each observation ri
as:
f (ri ; µ, σ1 , σ2 , P ) = N (ri ; µ, σ12 )P + N (ri ; µ, σ22 )(1 − P )
(this is a simple application of the standard rule for computing the marginal distribution
given the conditional distribution and the probability of the conditioning events).
Suppose now we have a i.i.d. sample where the density of each observation is the
mixture f . In order to estimate the unknown parameters, a sensible procedure is as
follows.
First step: estimate µ. Here we exploit the fact that µ is the same in both day
types. Hence the expected value of the mixture is, again µ and we can estimate it using
the simple sample average:
X ri
µ̂ =
n
i=1,...,n
We then plug this estimate into f and build the likelihood of our sample r:
Y
`r (σ1 , σ2 , P ) =
f (ri ; µ̂, σ1 , σ2 , P )
i=1,...,n
We can then compute the maximum likelihood estimates of the unknown parameters
σ1 , σ2 , P as the values which maximize ` (or, typically, its logarithm). Notice that this
79
procedure does not yield to analytic computation and requires numerical optimization
methods.
This simple mixture model, while rough, can be easily implemented and gives a very
good approximation of the return distribution with the exception of extreme tails. The
tails of a Gaussian mixture are still exponential while the extreme tail of the empirical
CDF seem to decrease at a slower, maybe polynomial, rate. However the quality of fit
is, in general, quite good up to α values even smaller than 1% that is, values which
allow VaR estimates.
This is a very simple example of mixture models. Many variants are possible (and
can be found in real world applications and in academic literature): we can increase the
number of components, we can also use different kinds of distribution as components,
we can model a dependency of P on time and on other observable variables (a sensible
choice are past squared returns).
80
81
-0,15
-0,1
-0,05
0
5
10
15
20
25
30
0
0,05
0,1
Two Gaussians with the same expected value and different stdev
0,15
82
-0,15
-0,1
-0,05
0
5
10
15
20
25
30
0
0,05
0,1
Mixture of two gaussians compared with single gaussian
with the same mean and variance as the mixture
0,15
φ(x;μ,s)
pφ(x;μ,s1)+(1-p)φ(x;μ,s2)
Examples
Exercise 5 - VaR.xls Exercise 5b - Gaussian Mixture Model.xls
Required Probability and Statistics concepts. Sections
6-12. A quick check.
The main difference between the first and the second part of the course is due to the
fact that in the second part we are mainly interested with vectors of returns. This is the
obvious point of view of any professional involved in asset pricing, asset management
and risk management.
To begin with, we need a compact and reasonably clear notation for dealing with
vectors of returns, both from the mathematical and the probabilistic/statistical point
of view. For this reason the second part of these handouts opens with two chapters: 6
and7 dedicated to a quick introduction to the basic notation. You can find something
more in the appendixes of these handouts: 18.
Most of the second part of these handouts is centered on the study and of the general
linear model 9 and of its applications in Finance. I expect this topic to be new for
most of the class, hence the handouts contain a rather detailed and complete, if simple,
introduction to this topic. Another important tool introduced in the following chapters
is principal component analysis 11 in the context of linear asset pricing models.
Most of what follows is self contained, however some basic concept and result of
Probability and Statistics is required. You can find this in the appendix 19.
Among the most important of these concepts and results to be added to those
already summarized in the first part of these handouts, I would point out: conditional
expectation and regressive dependence, point estimation, unbiasedness, efficiency.
Again, a short summary of these cam be found in 19. Among the most important
points see: from 19.43 to 19.47, from19.91 to 19.104.
6
Matrix algebra
I suppose the Reader knows what is a matrix and a vector and the basic rules for
multiplication between matrices and matrices and scalars and for sum between matrices. I also suppose the Reader to know the meaning and basic properties of a matrix
inverse and of a quadratic form. This very short section only recalls a a small number
of matrix results and presents a very useful result called “spectral decomposition” or
“eigendecomposition” theorem. Moreover we consider some differentiation rule.
83
In what follows I’ll write sums and products without declaring matrix dimensions
sum and multiplication. I’ll always suppose the matrices to have the correct dimensions.
The inverse of a square matrix A is indicated by A−1 with A−1 A = I = AA−1 . A
property of the inverse is that, if A and B have inverse then (AB)−1 = B −1 A−1 . Notice
that, if A and B are square invertible matrices and AB = I then, since (AB)−1 = I =
B −1 A−1 , by multiplying on the left by B and on the right by A we have BB −1 A−1 A =
BA = I.
The rank of a matrix A (no matter if square or not): Rank(A) is the maximum
number of linearly independent rows or columns in A. Put in a different way, the
rank of a matrix is the order of the biggest (square) matrix that can be obtained by
A deleting rows and/or columns and whose determinant is not zero. Obviously, then,
the rank of a matrix cannot be bigger that its smaller dimension.
A fundamental property of the rank of the product is this:
Rank(AB) ≤ min(Rank(A); Rank(B))
. If B is a q × k matrix of rank q then Rank(AB) = Rank(A). Analogously, if A is
a h × q matrix of rank q then Rank(AB) = Rank(B). Applying this we have that:
Rank(AA0 ) = Rank(A).
A symmetric matrix A is called positive semi definite (psd) iff for any column vector
x we have x0 Ax ≥ 0. If the inequality is strong (>) for all the vectors x not identically
null, then the matrix A is called “positive definite” (pd).
Often we must compute derivatives of functions of the kind x0 Ax (a quadratic form)
or x0 q (a linear combination of elements in the vector q with weights x) with respect
to the vector x.
In both cases we are considering a (column) vector of derivatives of a scalar function
w.r.t. a (column) vector of variables (commonly called a “gradient”). There is a useful
matrix notation for such derivatives which, in these two cases, is simply given by:
∂x0 Ax
= 2Ax
∂x
and
∂x0 q
=q
∂x
The proof of these two formulas is quite simple. We give a proof for a generic element
k of the derivative column vector
XX
x0 Ax =
xi xj ai,j
i
∂
P P
i
j
xi xj ai,j
∂xk
=
X
j6=k
j
X
X
X
xj ak,j +
xi ai,k +2xk ak,k =
xj ak,j +
xj ak,j +2xk ak,k = 2Ak,. x
i6=k
j6=k
84
j6=k
Where Ak,. means the k − th row of A and we used the fact that A is a symmetric
matrix. Moreover
X
x0 q =
xj q j
j
∂x0 q
= qk
∂xk
An important point to stress is that the derivative of a function with respect to a
vector always has the same dimension as the vector, so, for instance (remember that
A is symmetric):
∂x0 Ax
= 2x0 A
∂x0
A multi purpose fundamental result in matrix algebra is the so called “spectral
theorem”:
Theorem 6.1. If A is a (n × n) symmetric, pd matrix then there exist a (n × n)
orthonormal matrix X and a diagonal matrix Λ such that:
X
0
A = XΛX 0 =
λ j xj xj
j
where xj is the j −th column of X and is called the j −th eigenvector of A, the elements
λj on the diagonal of Λ are called the eigenvalues of A. These are positive, if A is pd,
and can always be arranged (rearranging also the corresponding columns of X) in non
increasing order.
The formula A = XΛX 0 is called the “spectral decomposition” of A.
If the matrix A is only psd with rank m < n a similar theorem holds but the matrix
X has only M columns and the matrix Λ is a square m × m matrix.
Notice that the spectral theorem implies that the rank of a p(s)d matrix A is equal
to the number of non null eigenvalues of A.
A property of the eigenvectors of a pd matrix A is that XX 0 = I (and, since in the
pd case X is square, we also have X 0 X = I). That is: the eigenvectors are orthonormal.
A nice result directly implied by this theorem when A is pd is this: A−1 = XΛ−1 X 0 .
In fact A−1 = (XΛX 0 )−1 = X 0−1 Λ−1 X −1 = XΛ−1 X 0 .
The Reader must notice that computing the spectral decomposition of a matrix,
while straightforward, from the numerical point of view, is by no mean easy to do by
hand. In order to understand this we can post multiply A times the generic xi :
X
0
Axi = XΛX 0 xi =
λj xj xj xi
j
85
Each term in the sum is going to be equal to 0 (orthonormal xj vectors) except the ith
which is going to be equal to xi λi so that xi solves the equation (A − λi I)xi = 0.
Notice that x0i xi = 1 so that the “trivial” solution: xi = 0 is NOT a solution of this
problem. Hence, any feasible solution requires |A − λi I| = 0 so that we see that the
,λi -s are the roots of the equation: |A − λI| = 0 (the so called “characteristic equation”
for A). If this determinant equation is written down in full it shall be seen that it is
a polynomial equation in the variable λ of degree equal to the rank k of A (its size,
if A is of full rank). As it is well known since high school days, polynomial equations
have k real and/or complex roots (the so called “fundamental theorem of algebra”).
However, an explicit formula for finding such roots only exists (in the general case) for
k ≤ 4. On the other hand, finding the roots of this equation is such relevant a problem
in applied Mathematics that numerical algorithms for computing them exist at least
from the times of Newton. P
0
The representation A =
j λj xj xj makes obvious many classic matrix algebra
results. For instance, we know that Az = 0 may have nontrivial solutions only if A
is non invertible. In the case of s symmetric psd matrix this implies that the number
of positive
is smaller than the size of the matrix. In this case, writing
P eigenvalues
0
Az =
j λj xj xj z = 0 immediately shows that the solution(s) to the homogeneous
linear system must be found among those (non null) vectors z which are orthogonal to
each eigenvector xj . Such vectors, obviously, cannot exist if A is invertible30 .
A last useful result is the so called “matrix inversion lemma”
(A − BD−1 C)−1 = A−1 − A−1 B(CA−1 B − D)−1 CA−1
7
Matrix algebra and Statistics
We use both random matrices and random vectors. A random matrix is simply a
matrix of random variables, the same for a random vector.
The expected value of a random matrix or vector Q is simply the matrix or vector
of the expected values of each variable in the matrix or vector and is indicated as E(Q).
For the lovers of formal language: k orthonormal vectors of size k (k−vectors) “span” a
k−dimensional space, in the sense that any vector in the space can be written as a linear combination
of such orthonormal vectors. For this reason, the only k−vector orthogonal to all k orthonormal
vectors (which means that the vector is not a linear combination of them) is the null vector. On
the other hand, given q < k orthonormal k−vectors, these span a q−dimensional subspace and there
exist other k − q orthonormal k−vectors which are orthogonal to the first q and span the “orthogonal
complement” of the space spanned by the q vectors. This is simply the space of all k−vectors which
cannot be written as linear combinations of the q vectors, equivalently: the space of all vectors which
are orthogonal to the q vectors. Yous see how a k−dimensional pd matrix implicitly defines a full
orthonormal basis for a k−dimensional basis. Moreover, the knowledge of its eigenvectors allow us to
split this space in orthogonal subspaces.
30
86
For a random (column) vector z we define the variance covariance matrix, indicated
with V (z) but sometimes with Cov(z) or C(z) as:
V (z) = E(zz 0 ) − E(z)E(z 0 )
For the expected value of a matrix or a vector we have a result which generalizes
the linearity property of the scalar expected value.
Let A1 and A2 be random matrices (of any dimension, including vectors) and
B, C, D, G, F non random matrices. We have:
E(BA1 C + DA2 E + F ) = BE(A1 )C + DE(A2 )G + F
Where, as anticipated, we suppose that all the products and sums have meaning that
is: the dimensions are correct and the expected values exist.
The covariance matrix has a very important property which generalizes the well
known result about the variance of a sum of random variables.
Let z be a random (column) vector H a non random matrix and L a non random
vector then:
V (Hz + L) = HV (z)H 0
Suppose for instance that z is a 2 × 1 vector and H has a single row made of ones, in
this case the above result yields to the usual formula for the variance of the sum of 2
random variables.
7.1
Risk budgeting
This result has several applications in Finance, for example, suppose H is given by
a single row that is: we are considering a single linear combination of the random
variables in z. In this case:
X
X
X
V (Hz) = HV (z)H 0 =
Hj
Hi V (z)i,j =
Hj Cj
j
i
P
j
where Cj = i Hi V (z)i,j (the indexes i, j run over the dimensions of V (z)). In words:
Cj is the linear combination with weights Hi of all the covariances between zj and all
the zi (one of these is the covariance of zj with itself, that is, its variance).
If we interpret z as a vector of (arithmetic) returns and H as a vector of fixed
portfolio weights (for instance the weights for a one period buy and hold portfolio),
the above result expresses the variance of the portfolio as a linear combination of (non
necessarily positive) contributions due to each security return. Each contribution is
the linear combination with weights Hi of all the covariances of the return zj with all
returns or, in other words, the covariance of the return zj with the return Hz of the
portfolio.
Measures like this are commonly computed by portfolio managers as tools for measuring the risk contribution of each security in a portfolio, a practice called “risk budgeting”
87
7.2
A varcov matrix is at least psd
An important property of a covariance matrix is that it is at least psd. In fact, if A
is the covariance matrix, say, of the vector z, then the quantity x0 Ax is simply the
variance of the linear combination of random variables x0 z where x is a non random
vector. But a variance cannot be negative, hence x0 Ax ≥ 0 for all possible choices of x
and this means A is at least psd.
Suppose now that A is the (suppose pd) covariance matrix of the random vector
z. The spectral decomposition theorem tells us that we can write A as A = XΛX 0 ,
with X orthogonal (i.e. XX 0 = X 0 X = I) and Λ diagonal (Λ = diag(λ1 , ..., λn )). The
columns of X and the diagonal elements of Λ are the eigenvectors and eigenvalues of
A, respectively (i.e. Axi = λxi ).
Let p = X 0 z. Then p is a vector of non correlated random variables with covariance
matrix Λ. In fact V (p) = X 0 V (z)X = X 0 AX = Λ. Moreover z = Xp, since Xp =
XX 0 z = z.
In other words, we can always write any random vector with pd covariance matrix
as a set of linear combinations of uncorrelated random variables. Notice that E(p) =
X 0 E(z) and E(z) = XE(p). When the covariance matrix of z is only psd, a similar
result holds but the dimension of the vector p shall be equal to the rank of V (z) and
so smaller that the dimension of z.
This in particular implies that any pd matrix can be interpreted as a covariance
matrix that is: there exist a random vector of which the given matrix is the covariance
matrix (this is also true for psd matrices, but the proof is slightly more involved).
Recall that, if A is only psd, say of size k and rank q < k, then it has q eigenvectors
xl corresponding to positive eigenvalues.
However, there exist k − q vectors zj∗ such that zj∗ 0 zj∗ = 1, zj∗ 0 zj∗0 = 0 for j 6= j 0 and
∗0
zj xl = 0 for any vector xl which is eigenvector of A.
Using any of these zj∗ ,or any non zero scalar multiple of them, is then possible to
build linear combinations of the random variables whose varcov matrix is A such that
these linear combinations have zero variance.
If A is a variance covariance matrix of linear returns for financial securities, this
implies that it is possible to build a portfolio31 of such securities and, possibly, the risk
free rate, such that the single securities returns are random but the overall position
return is non random.
By no arbitrage, any such position, being risk free, must yield the same risk free
rate (otherwise you can borrow at the lower rate and invest at the higher rate for a
sure, possibly unbounded, profit).
This very important property is central in asset pricing theory and, more in general,
in asset management.
31
Or, alternatively, a long short position.
88
(Compare this with the example regarding the constrained minimization of a quadratic
form in the Appendix).
As we shall see in the section dedicated factor models and principal components,
it is often the case that covariance matrices of returns for large sets of securities are
approximately not of full rank, that is: it may be that all eigenvalues are non zero but
many of these are almost zero. In this case it is possible to build portfolios of risky
securities whose return is “almost” riskless. This has important applications in (hedge)
fund management and, more in general, trading ad asset pricing.
7.3
Note
These are the barely essential matrix results for this course. Many more useful results
of matrix algebra exist, both in general and applied to Statistics and Econometrics.
For the interested student the Internet offers a number of useful resources.
We limit ourselves to quote a “matrix cookbook” you could download from the
Internet the title is “The Matrix Cookbook” 32 .
Examples
Exercise 6-Matrix Algebra.xls
8
The deFinetti, Markowitz and Roy model for asset
allocation
This section on the deFinetti-Markowitz-Roy model33 is here both as an exercise in
matrix calculus and because it shall be useful to us in what follows.
Suppose one investor is considering a buy and hold portfolio strategy for a single
period of any fixed length (the length is not a decision variable in the asset allocation)
from time t to time T . Let us indicate with R the random vector of linear total returns
from a given set of (k) stocks for the time period. We suppose the investor knows (not
“estimates”) both µR = E(R) and Σ = V (R). We suppose Σ to be invertible.
Moreover, there exists a security which can be bought at t and whose price at T is
known at t (typically a non defaultable bond) called “risk free security”. Let rf be the
(linear) non random return from this investment over the period.
A possible link is: http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf this
worked when I last checked it in August 2019, but I cannot guarantee stability of the link.
33
See the appendix 15 for a summary of the story of this result.
32
89
The fund manager’s strategy is to invest in the risk free security and in the stocks
at time t and then liquidate the investment at time T . The relative amounts of each
investment in stocks are in the column vector w, while 1 − w0 1 is the relative amount
invested in the risk free security.
The return of the portfolio between t and T is:
RΠ = (1 − w0 1)rf + w0 R
So that the expected value and the variance of the portfolio return are
E(RΠ ) = (1 − w0 1)rf + w0 µR
and
V (RΠ ) = w0 Σw
The problem for the fund manager is to choose w so that, for a given expected value
c of the portfolio return, the variance of the portfolio return is minimized. In formulas
min
w:(1−w0 1)rf +w0 µR =c
w0 Σw
Equivalently the fund manager could fix the variance and choose w such that the
expected return is maximized.
In both problems it would be sensible to use an inequality constraint. For instance,
in the first problem, we could look for
min
w:(1−w0 1)rf +w0 µR =c
w0 Σw
We choose the = version just for allowing direct use of Lagrange multipliers as we’ll
see in what follows.
Notice that we do not assume the sum of the elements of w to be 1. This shall
be true only if no risk free investment is made. However, obviously, if we complement
the vector w with the fraction of portfolio invested in the risk free security, the sum
of all the portfolio fractions is 1. Moreover we do not require each element of w to
be positive. This can be done but not in the the straightforward way we are going to
follow34 .
Before going on with the solution of our problem. it is proper to discuss an interesting property
of the mean variance criterion.
The mean variance criterion may seem sensible and, actually, it usually is sensible. However it is
easy to build examples where the results are counter intuitive.
Suppose only two possible scenarios exist, both with probability 1/2. You are choosing between two
securities: A and B. In the first state of the world the return of both securities is 0, in the second it
is 1 for the first and 4 for the second. So the expected returns are .5 and 2 and the variances .25 and
4. Suppose now you wish to minimize the variance for getting at least an expected return of .5. Both
34
90
In order to solve the problem we consider its Lagrangian function (notice the rearranged constraint):
L = w0 Σw − 2λ(rf − c + w0 (µR − 1rf ))
And differentiate this with respect to the vector w and the scalar λ (remember the
differentiation rules).
∂L
= 2Σw − 2λ(µR − 1rf )
∂w
∂L
= −2(rf − c + w0 (µR − 1rf ))
∂λ
We then define the system of “normal equations” as
n
Σw−λ(µR −1rf )=0
rf −c+w0 (µR −1rf )=0
It is easy to solve the first sub system for w as:
w = λΣ−1 (µR − 1rf )
At this point we already notice that the required solution is a scalar multiple (λ) of a
vector which does not depend on c. In other words the relative weights of the stocks
in the portfolio are already known and do not depend on c. What is still not known is
the relative weight of the portfolio of stocks with respect to the risk free security.
This is a first instance of a “separation theorem”: the amount of expected return
we want to achieve only influences the allocation between the risk free security and
the stock portfolio but does not influences the allocation among different stocks (the
optimal risky portfolio is uniquely determined).
As a second comment we see that, had our objective been that of solving
1 0
0
0
max (1 − w 1)rf + w µR − w Σw
w
2λ
that is: had we wished to maximize some “mean variance” utility criterion, our result
would have been exactly the same. Since the (negative) weight of the variance in this
1
criterion is given by − 2λ
as a rule 1/λ is termed “risk aversion parameter”.
investments yield at least that expected return, however, according to the mean variance criterion,
you would choose the first as the variance of the second is bigger. But the second is going to yield
you a return greater or equal to the first in both states of the world, in other words you are trading
more for less. Notice that, since the two investments are perfectly correlated, you are going to invest
just in one of them. The reason of the paradox is simple: you are considering "variance" as bad, but
variance may be due both to bigger losses and to bigger gains. Since with the mean variance criterion
you may end up trading more for less we can conclude that, in general, the mean variance criterion
does not satisfy no arbitrage.
91
The rest of this section is useful as an exercise in matrix algebra
(and maybe in Finance) but is not required for the exam
If we see λ as the Lagrange multiplier, it is possible, and useful, to express it in more
detail. Substitute the solution for w in the constraint equation and find:
λ=
c − rf
(µR − 1rf )0 Σ−1 (µR − 1rf )
For the weights w, we have:
w=
(c − rf )Σ−1 (µR − 1rf )
(µR − 1rf )0 Σ−1 (µR − 1rf )
As an aid to intuition, consider what happens in the case Σ is diagonal with elements
σR2 i . In this case the stock portfolio weights are proportional to (µRi − rf )/σR2 i .
It is quite useful to write this as λ = (c − rf )/q where q = (µR − 1rf )0 Σ−1 (µR − 1rf ).
Notice that q does not depend on c that is: q is independent of which optimal portfolio
(choice of expected value) you are building.
In fact, if w is the solution to our problem, we have
w0
R)
λ
√
So that q represents the standard deviation of the solution portfolio divided by λ
or, in other words, the “risk” of our optimal portfolio expressed in “units of λ”.
Since
E(RΠ ) − rf
c − rf
= λ2
λ=
w0
V (RΠ )
V ( λ R)
We also have
V (RΠ )
λ=
E(RΠ ) − rf
q =V(
So that, from q = (c − rf )/λ = (E(RΠ ) − rf )2 /V (RΠ ), we have
√
E(RΠ ) − rf
q= p
V (RΠ )
As we did see before q is a constant identical for all efficient portfolios, that is:
q = (µR − 1rf )0 Σ−1 (µR − 1rf ), this last equation implies that, in the expected value
standard deviation space, efficient portfolios expected values and standard deviations
√
√
are connected by a line whose slope in q. In other words q is both the so called
“Sharpe ratio” and the so called “price of risk” of the efficient portfolio (we obviously
suppose E(RΠ ) − rf ≥ 0)35 .
It should be noticed, however, that this “price of risk” interpretation is a bit extraneous to the
mean variance context and shall take all its weight in a CAPM context. But this is not a topic of our
course.
35
92
Suppose now that we choose a c such that the corresponding w̃ is such that 10 w̃ = 1
that is: no investment in the risk free security. In other words we are considering the so
called tangency portfolio. Since in the CAPM model the tangency portfolio becomes
the market portfolio we call the corresponding c = E(RM ). In this case we have:
10 w̃ = (E(RM ) − rf )10 Σ−1 (µR − 1rf )/qM = 1
so that
qM = (E(RM ) − rf )10 Σ−1 (µR − 1rf )
or, in other terms:
w̃ =
Σ−1 (µR − 1rf )
10 Σ−1 (µR − 1rf )
The tangency portfolio is the portfolio where a line starting at the risk free rate, in
the expected value standard deviation plane, is tangent to the efficient frontier. Notice
that the weights of the (relevant, that is: possibly tangent) efficient frontier portfolios
are simply given by w̃ changing rf .
Due to the separation theorem, we can express the return of any efficient portfolio
(RΠ ) as the return of a portfolio invested in the risk free rate and in the tangency
portfolio:
RΠ = (1 − γ)rf + γ w̃0 R
where:
γ = (c − rf )10 Σ−1 (µR − 1rf )/q
Remember that c = E(RΠ ) so that:
γ = (E(RΠ ) − rf )10 Σ−1 (µR − 1rf )/q
A comment on Markowitz and the CAPM.
The Markowitz model is not the CAPM: it only gives us all mean variance efficient
combinations of a risk free and a given set of risky securities with given expected
value vector and covariance matrix. For a given set of risky securities and a given
risk free rate all efficient portfolios have the same Sharpe ratio as they are different
linear combination of the risk free security and of the same risky portfolio. However,
if we change set of securities the efficient risky portfolio changes, in general, and the
corresponding Sharpe ratio changes too.
However, suppose the set of risky securities contains ALL risky securities. It could
be difficult, in this case, not to state that there exists one and only one (if the covariance
matrix is invertible) optimal risky portfolio and, by consequence, only one possible
Sharpe ratio for all possible (efficient) investments. Here we are very near to the
CAPM and we are also very near to understanding that the use of a Sharpe ratio
93
for comparing securities, is, perhaps, unwarranted (Sharpe would say something much
stronger).
A last relevant observation is this: in this chapter we suppose expected values and
covariances as given and known. What happens when (as is commonly the case) this
is not true?
9
9.1
Linear regression
Weak OLS hypotheses
We begin with the so called “weak” OLS hypotheses.
Let Y be an n × 1 random vector containing a sample of n observations on a
“dependent” variable y. Let X be a non random n × k matrix of rank k, β a k × 1 non
random vector and a n × 1 random vector.
Let:
Y = Xβ + and suppose E() = 0 and V () = σ2 In .
These hypotheses are best understood in a statistical (that is: estimation in repeated samples) setting. Each sample we are going to draw shall be given by a realization of Y . In each sample X (observable) and β (unobservable) shall be the same.
What makes Y “random” that is: changing in a (partially) unpredictable (for us) way
from sample to sample, is the random “innovation” or “error” vector which we cannot
observe.
As for the “partially” clause: under the assumed hypotheses is clear that
E(Y ) = E(Y |X) = Xβ
So that the expected value of the random Y is not random (while it is unknown as
β is unknown).
In this sense we may say that we are modeling the regression function of Y on X
that is: the conditional expectation of Y given X. However, since the matrix X is
non random, which means that the probability of observing that particular X is one
or, equivalently as observed above, that in any sample X (and β) shall always be the
same (so that Y is random just due to the effect of the random element , in fact:
V (Y ) = V (Xβ + ) = V () = σ2 In ), the conditional expectation shall be the same as
the unconditional expectation.
A more interesting model, from the point of view of financial applications, shall be
considered below when we shall allow X to be itself random.
94
9.2
The OLS estimate
The problem is to estimate β. Under the above hypotheses different estimation procedures lead all to the same estimate: the Ordinary Least Squares Estimate βbOLS .
The simplest way for deriving βbOLS is trough its namesake, that is: find the value
βbOLS that minimizes 0 : the sum of squared errors.
The objective function shall be:
0 = (Y − Xβ)0 (Y − Xβ)
The first derivatives with respect to β are:
∂(Y − Xβ)0 (Y − Xβ)
= 2X 0 Xβ − 2X 0 Y
∂β
The system of normal36 equations (where we ask for the β that sets to zero the first
derivatives) is:
X 0 Xβ = X 0 Y
Since the rank of X is k, X 0 X can be inverted and the (unique) solution of the system
is:
βbOLS = (X 0 X)−1 X 0 Y
As usual we do not check the second order conditions, we should! Informally we see
that we are minimizing a sum of squares which may go to plus infinity, not to minus
infinity so that our stationary point should be a minimum non a maximum (this is by
no means rigorous but it could be made so).
It is easy to show that βbOLS is unbiased for β. In fact:
E(βbOLS ) = (X 0 X)−1 X 0 E(Y ) = (X 0 X)−1 X 0 Xβ = β
where in the first passage we use the fact that X is non random and in the second one
we use the hypothesis that β is non random and that E() = 0.
It is also easy to compute V (βbOLS ):
V (βbOLS ) = (X 0 X)−1 X 0 V (Y )X(X 0 X)−1 = (X 0 X)−1 X 0 σ2 In X(X 0 X)−1 =
= σ2 (X 0 X)−1 X 0 X(X 0 X)−1 = σ2 (X 0 X)−1 .
Here as in the Markowitz model, the term “normal” does not mean “usual” or “standard”. As we
are going to see in a moment the first order conditions in systems like these require that each of a
set of products between vectors be 0. This has to do with requiring that vectors be perpendicular
and the term “normal” derives from Latin “normalis” meaning “done according to a carpenter’s square
(“norma” in Latin)” a carpenter’s square, as is well known, is made of two rulers crossed at 90° (the
triangular square of high school has an added side). By extension the term came to mean: “done
according to rules” (in effect a square is two rule(r)s...) and from this the today most common, non
mathematical, usage of the word .
36
95
9.3
The Gauss Markoff theorem
All this not withstanding, the choice of βbOLS as an estimate of β based only on the
minimization of 0 could be disputed: in which sense this should be a “good” estimate
from a statistical point of view?
A strong result in favor of βbOLS as a good estimate of β is the Gauss-Markoff
theorem.
Definition 9.1. If β ∗ and β ∗∗ are both unbiased estimates of β we say that β ∗ is not
worse than β ∗∗ iff D = V (β ∗∗ ) − V (β ∗ ) is at least a psd matrix.
Notice that in the univariate case this definition boils down to the standard definition. Moreover, suppose that what you really want is to estimate a set of linear
functions of β say Hβ where H is any nonrandom matrix (of the right size so that it
can can pre multiply β). Suppose you know that β ∗ is better than β ∗∗ according to
the previous definition. In this case it is easy to show that Hβ ∗ is better than Hβ ∗∗ ,
according to the same definition, as an estimate of Hβ. In fact:
V (Hβ ∗∗ ) − V (Hβ ∗ ) = H(V (β ∗∗ ) − V (β ∗ ))H 0 = HDH 0
And if D is at least psd then HDH 0 is at least psd (why?).
This “invariance to linear transform” property is the stronger argument in favor of
this definition of “not worse” estimator.
As an exercise find a proof of the fact that on the diagonal of D all the elements
must be non negative. In other terms: all the variances of the * estimate are not bigger
than the variances of the ** estimate.
Obviously this definition of “not worse” would amount to little if the estimates were
allowed to be biased. In this case any vector of constants would be not worse than any
other estimate, in terms of variance. For this we then require for unbiasedness.37
We are now just a step short in being able to state an important result in OLS
theory. We would like to prove a theorem of the kind: the best unbiased estimate of
β is β̂OLS . Alas, this is actually not true in this generality. The theorem results to be
true if we further ask for the class of competing estimates to be linear in Y that is:
each competing estimate must be of the form HY with H a known nonrandom matrix.
Definition 9.2. We say that β̂ is a linear estimate of β iff β̂ = HY where H is a non
random matrix.
We thus arrive to the celebrated Gauss Markoff theorem38 .
Theorem 9.3. Under the weak OLS hypotheses βbOLS is the Best Linear Unbiased
Estimate (BLUE) of β.
As an alternative we could us the concept of mean square error matrix in the place of the variance
covariance matrix.
38
See 16 for some details about the history of this theorem.
37
96
Proof. Any linear estimate of β can be written as β̂ = ((X 0 X)−1 X 0 + C)Y with and
arbitrary C. Since the estimate must be unbiased we have:
E(β̂) = ((X 0 X)−1 X 0 + C)Xβ = β + CXβ = β
and this is possible only if CX = 0. Let us now compute V (β̂):
V (β̂) = σ2 ((X 0 X)−1 X 0 + C)((X 0 X)−1 X 0 + C)0 =
= σ2 ((X 0 X)−1 + CC 0 + (X 0 X)−1 X 0 C 0 + CX(X 0 X)−1 )
but, since CX = 0 the last two terms in the above expression are both equal to 0. In
the end we have:
V (β̂) = σ2 ((X 0 X)−1 + CC 0 )
and this is V (βbOLS ) plus σ2 (CC 0 ) which is an at least psd matrix (why?). We have
shown that the covariance matrix of any linear unbiased estimate of β can be written
as the covariance matrix of βbOLS plus a matrix which is at least psd. To summarize:
we have shown that βbOLS is BLUE.
As we shall see in the “stochastic X” section, the Gauss-Markoff theorem still holds
under suitable modifications of the weak OLS hypotheses in the case of stochastic X.
There is an equivalent theorem which is valid when V () = Σ with Σ any non random (pd) matrix known up to a multiplicative constant. In this case the BLUE estimate
is β̂GLS = (X 0 Σ−1 X)−1 X 0 Σ−1 Y where GLS stands for Generalized Least Squares, but
we are not going to use this in this course. Notice that if Σ is not known up to a
multiplicative constant the above is not an estimate.
The proof begins by recalling that any pd matrix has a pd inverse. Moreover any
pd matrix A can be written as A = P P 0 with P invertible. We then have Σ = P P 0
and Σ−1 = (P P 0 )−1 = P 0−1 P −1 . Now, multiply the model Y = Xβ + times P −1
as P −1 Y = P −1 Xβ + P −1 . Call now Y ∗ = P −1 Y ;X ∗ = P −1 X and ∗ = P −1 .
We have that Y ∗ = X ∗ β + ∗ satisfies the weak OLS hypotheses, and, in particular,
0
V (∗ ) = V (P −1 ) = P −1 ΣP −1 = P −1 P P 0 P 0−1 = I.
0
0
We can then follow the standard proof up to the result: (X ∗ X ∗ )−1 X ∗ Y ∗ is BLUE.
In the original data this is equal to (X 0 P 0−1 P −1 X)−1 X 0 P 0−1 P −1 Y = (X 0 Σ−1 X)−1 X 0 Σ−1 Y =
β̂GLS where GLS stands for Generalized Least Squares.
The result seems very general and, in a sense, it is so. However we should take into
account that the above proof implies Σnon diagonal but KNOWN. Otherwise we could
not compute P and the estimate.
Most cases of a linear model with correlated residuals do not allow for “known”
Σ and the GLS estimate cannot directly be used. Most estimates used in practice in
these cases can be seen as versions of the GLS estimate where Σis itself “estimated” in
some way.
97
We do not consider this (interesting and very relevant) topic in this introductory
course.
The above proof deserves some further consideration. Under the standard weak
OLS hypotheses, both with non stochastic and with stochastic X, the OLS estimate
works fantastically well: it minimizes the sum of squares of the errors, is unbiased, is
BLUE. This is perhaps too much and we should surmise that some of this bonanza
strictly depends on a clever choice of the hypotheses.
This is exactly the case. We did just prove that, even in the case of non stochastic
X, when the covariance matrix of residuals is NOT σ2 I, the BLUE estimate is NOT
the OLS estimate. In this case “minimizing the sum of squared errors” is not equivalent
to finding the “best” estimate in the Gauss Markoff sense.
9.4
Fit and errors of fit
We call Ŷ = X β̂OLS the “fitted” values of Y . In fact Ŷ is to be understood as an
estimate of E(Y ) = Xβ. On the other hand ˆ = Y − Ŷ bear the name of “errors of fit”.
Notice that, by the first order conditions of least squares, we have:
X 0 X β̂OLS = X 0 Y
X 0 (X β̂OLS − Y ) = 0
X 0 ˆ = 0
This in particular implies
0
βbOLS
X 0 ˆ = Ŷ 0 ˆ = 0
This result is independent on the OLS hypotheses and depends only on the fact
that βbOLS minimizes the sum of squared errors.
9.5
R2
A useful consequence of this result, joint with the assumption that the first column
of X is a column of ones, allows us to define an index of “goodness of fit” (read: how
much did I minimize the squares?) called R2 .
In fact:
Y 0 Y = (Ŷ + ˆ)0 (Ŷ + ˆ) = Ŷ 0 Ŷ + ˆ0 ˆ
where the last equality comes from the fact that Ŷ 0 ˆ = 0.
98
Moreover if X contains as first column a column of ones X 0 ˆ = 0 implies that the
sum, hence the arithmetic average, of the vector ˆ: ¯ˆ is equal to zero. So:
¯
¯
Ȳ = Ŷ + ¯ˆ = Ŷ
and:
¯
Y 0 Y − nȲ 2 = Ŷ 0 Ŷ − nŶ 2 + ˆ0 ˆ
where n is the length of the vectors (number of observations). In other words, indicating
with V ar(Y ) the numerical variance of the vector Y (that is: the mean of the squares
minus the squared mean), we have:
V ar(Y ) = V ar(Ŷ ) + V ar(ˆ)
We see that the variance of Y neatly decomposes in two non negative parts. There
is no covariance! This is totally peculiar to the use of least squares and implies the
definition of a very natural measure of “goodness of fit’”:
R2 =
V ar(Ŷ )
V ar(ˆ)
=1−
V ar(Y )
V ar(Y )
Notice that, in order to be meaningful, R2 requires for the presence of a column of
ones (or any constant, in fact) in X. Otherwise the mean of the errors of fit may be
different than 0 and the passage from sum of squares to variances is no more possible
(the mean of Ŷ shall not, in general, be the same as the mean of Y ).
Due to sampling error we may expect to observe a positive R2 even when there is
non “regression” between Y and X. We can also say something about the expected size
of an R2 when it should actually be 0. In fact, suppose, conditionally to a n × k matrix
X of regressors, Y is a vector of n iid random variables of variance σ 2 so that there
should be an R2 of 0 out of the regression of Y on X. However the expected value
of the sampling variance of the elements of Y (that is: the denominator of the R2 ) is
σ 2 while we can show that the expected value of sampling variance of the elements of
so that the expected value of the sampling variance
the error of fit vector ε̂ is σ 2 n−k
n
of Ŷ in the same regression is going to be σ 2 nk (because Y = Ŷ + ε̂) and this is the
numerator of the R2 .
While we know that the expected value of the ratio is not the ratio of the expected
values, this could still be a good approximation, so we can say that, in the case of no
regression at all between Y and X (that is: theoretical value of R2 equal to 0), the
σ2 k
expected value of the R2 is approximated by σ2n = nk and this number could be quite
big if you use many variables and do not have many observations.
This simple fact should make use wary in using regression as an “exploratory” tool
for finding the “most relevant variables” in a wide set of potential candidates. Such
attitude has been common, and was successfully criticized, many times in the past
99
and, today, is back on fashion within the “data mining” movement. “To be wary” does
not mean “to utterly avoid”: taken with care such procedures, and, more in general,
exploratory data analysis, may be useful.
9.6
More properties of Ŷ and ˆ
Let us now study some property of Ŷ and ˆ depending on the OLS hypotheses. First
we compute the expected values and covariance matrices of both vectors.
E(ˆ) = E(Y − X βbOLS ) = Xβ − Xβ = 0
V (ˆ) = V (Y − X βbOLS ) = V (Y − X(X 0 X)−1 X 0 Y ) = V ((I − X(X 0 X)−1 X 0 )Y ) =
= σ2 (I − X(X 0 X)−1 X 0 )(I − X(X 0 X)−1 X 0 ) = σ2 (I − X(X 0 X)−1 X 0 )
this because, by direct computation, we see that
(I − X(X 0 X)−1 X 0 )(I − X(X 0 X)−1 X 0 ) = (I − X(X 0 X)−1 X 0 )
that is: (I − X(X 0 X)−1 X 0 ) is an idempotent matrix.
With similar passages we find that:
E(Ŷ ) = Xβ
V (Ŷ ) = σ2 X(X 0 X)−1 X 0
In summary: we see that Ŷ is indeed an unbiased estimate of E(Y ). On the other
hand we see that ˆ shows a non diagonal correlation matrix even if (or, better, just
because) the vector is made of uncorrelated errors.
This property of the estimated residuals is, in some sense, unsatisfactory and lead
some researchers into defining a different estimate of the residuals (non OLS based)
with the property of being uncorrelated under OLS hypotheses. This different estimate,
which we do not discuss here, is known in the literature as BLUES residuals (where
the ending S stand for “scalar” that is: with diagonal covariance matrix).
9.7
Strong OLS hypotheses and testing linear hypotheses in the
linear model.
This is a very short section for a very relevant topic. We are not going to deal with
general testing of linear hypotheses but only with those tests which are routinely found
in the output of standard OLS regression computer programs.
100
Let us begin with introducing Strong OLS hypotheses. In short these are the same
as weak OLS hypotheses (with non stochastic X) plus the hypothesis that the vector
is not only made of uncorrelated, zero expected value and constant variance random
variable but also is a vector distributed as an n dimensional Gaussian density with
expectation vector made of zeros and diagonal variance covariance matrix39 .
Why this hypothesis? When we wish to test hypotheses we need to find distributions
of sample functions, for instance we are going to need the distribution of β̂OLS . Up
to now we know that under weak OLS hypotheses β̂OLS has expected value vector
β and variance covariance matrix σ2 (X 0 X)−1 . Moreover we know that β̂OLS = β +
(X 0 X)−1 X 0 that is: it is a linear function of (and β and X are non stochastic). With
the added hypothesis we can then conclude that β̂OLS has Gaussian distribution with
expected value vector β and variance covariance matrix σ2 (X 0 X)−1 .
A k dimensional random vector ze is distributed according to a k dimensional Gaussian density
with expected values vector µ and variance covariance matrix Σ if and only if the density of the any
vector z of possible values for z̃is given by
39
k
1
1
f (z; µ, Σ) = (2π)− 2 |Σ|− 2 exp(− (z − µ)0 Σ−1 (z − µ))
2
(As usual in the text we shall omit tildas for distinguishing between RV and their values when this
should not create confusion).
If we remember the properties of determinants and inverses of diagonal matrices we see from this
formula that, in the case of diagonal covariance matrix, this density becomes a product of k uni
dimensional Gaussian densities (one for each element of the vector z̃). So, in the Gaussian case, non
correlation and independence are the
In fact,if Σ is a diagonal matrix with diagonal terms σi2
 same.
1
0
0
σ12


Qk
..
we have |Σ| = i=1 σi2 and Σ−1 = 
. 0 
 so that
 0
0
0 σ12
k
f (z; µ, Σ) = (2π)
=
−k
2
k
Y
1
σ
i=1 i
!
k
1 X (zi − µi )2
exp −
2 i=1
σi2
!
=
Y
k
1
1 (zi − µi )2
=
f (zi ; µi , σi2 )
(2πσi2 )− 2 exp −
2
2
σ
i
i=1
i=1
k
Y
In words: with a diagonal covariance matrix the joint density is the product of the marginal densities,
the definition of independence.
An important property of a k dimensional Gaussian distribution is that, if A and B are non stochastic matrices (of dimensions such that A+B z̃ is meaningful), then the distribution of A+B z̃ is Gaussian
with expected vector A + Bµ and variance covariance matrix BΣB 0 . Linear transforms of Gaussian
random vector are Gaussian random vectors. This, for instance, implies that, in effect, if z̃ is a Gaussian random vector, then each z̃i is Gaussian as we stated a moment ago in the proof of equivalence
between non correlation and independence for the Gaussian distribution. This is easy to see, just
0
0
write z̃i = 1i z̃ where 1i is a k dimensional row vector with null elements except one 1 in the i − th
place and apply the linearity property.
101
This may seem too expedient: OK, computations are now simple but why Gaussian
errors? In fact it often is too expedient, and the pros and cons of the hypothesis could
(and are) discussed at length. For the moment we shall take it as a beginner’s use in
the econometric world we live in, a use to be taken with much care.
We suppose everybody knows what a Statistical Hypothesis is (see Appendix in
case). We now define a “linear” statistical hypothesis.
A linear hypothesis on β can be written as Rβ = c where R is a matrix of known
constants and c a vector of known constants. For the purpose of this summary we shall
concentrate on two particular R: a 1 × k vector R where only the j th element is 1 and
the others are zeros and a (k − 1) × k matrix where the first column is of zeros an the
remaining (k − 1) × (k − 1) square matrix is an identity. In both cases c is made of
zeros (in the first case a single 0 and in the second a k − 1 vector of zeros).
The first kind of hypothesis is simply that the j th β is zero (while all other parameters are free to have any value), the second kind of hypothesis is simply that all
parameters are jointly zero (with the possible exception of the intercept). For (non
trivial) historical reasons these hypothesis are considered so frequently relevant that
any program for OLS regression tests them. The fact that these hypotheses be of your
interest is for You to evaluate.
I shall not detail the procedure for the test of the hypothesis that all parameters,
except, possibly, the intercept are jointly equal to zero. I only mention the fact that
the result of this test is displayed, usually, in any OLS program output. The name of
this test is F test.
A little more detail on the univariate test.
The standard test for the Hypothesis H0 : βj = 0 against H1 : βj 6= 0 (complete by
yourself the hypotheses) requires the distribution of β̂OLS , and this, as we wrote above,
a strengthening of the OLS hypotheses which takes the form of the assumption that is distributed according to an n−variate Gaussian: ≈> Nn (0, σ2 In ).
We do not discuss here the reasons for and against this hypothesis.
Under this hypothesis, as seen above, we can show that: βbOLS ≈> Nk (β, σ2 (X 0 X)−1 ).
Hence the ratio:
β̂j − βj
p
σ2 {(X 0 X)−1 }jj
(I drop the subscript OLS from the estimate in order to avoid double subscript problems) is distributed according to a standard Gaussian (i.e. N1 (0, 1)).
Suppose now we set βj = 0 in the above ratio. In this case the distribution of the
ratio shall be a standard Gaussian only if H0 : βj = 0 is true. This allows us to define
a reasonable rejection region for our test.
Reject H0 : βj = 0 with a size of the error of the first kind equal to α iff:
β̂j
p
∈
/ [−Z1− α2 ; +Z1− α2 ]
σ2 {(X 0 X)−1 }jj
102
(which can be written in many equivalent forms).
In the above formula {(X 0 X)−1 }jj is the j−th element in the diagonal of (X 0 X)−1 .
Z1− α2 is the quantile of the standard Gaussian which leaves on his left 1 − α2 of
probability.
This solves the problem if σ2 is known. In the case it is unknown estimate it with:
σ̂2 =
ˆ0 ˆ
n−k
and use as critical region:
β̂j
p
∈
/ [−n−k t1− α2 ; +n−k t1− α2 ]
2
σ̂ {(X 0 X)−1 } jj
where n−k t1− α2 is the quantile in a T distribution with n − k degrees of freedom which
leaves on its left a probability equal to 1 − α2 .
The use of the T distribution is the reason for the name given to this test: the T
test.
Another test whose results are as a rule reported in all standard outputs of a
regression package is the F test. As in the case of The T test there exist many different
hypotheses which can be tested using a F test but the standard hypothesis tested by
the universally reported F test is:
H0 : all the betas corresponding to non constant regressors are jointly equal to 0
H1 : at least one of the above mentioned betas is not 0
The Idea is that, if the null is accepted (i.e. big P-value), no “regression” exists
(provided we did not make an error of the second kind, obviously). Hence the popularity
of the test.
9.8
“Forecasts”
Here we intend the term “forecast” in a very restricted meaning.
Suppose you estimated β using a sample of Y and X, let us sat n observations
(rows).
Now suppose a new set of q rows of X is given to you and you are asked to assess
what could be the corresponding Y .
Set as this the question does not allow for an answer. We need to assume some
connection between the old rows and the new rows of data. A possibility is as follows.
Let the model for the n rows of data used to estimate β with β̂OLS be
Y = Xβ + Suppose we now have data for m more lines for the variables in X and call these
Xf . Let the model for the corresponding new “potential” observations be
103
Yf = Xf β + f
And suppose (we consider here the general case where X can be stochastic) that
E(|X, Xf ) = 0= E(f |X, Xf ), V (|X, Xf ) = σ2 In ,V (f |X, Xf ) = σ2 Im and E(0f |X, Xf ) =
0.
(Notice the double conditioning to both X and Xf )
In this case the obvious (BLUE) estimate for E(Yf |X, Xf ) isYˆf = Xf β̂ with expected value Xf β and variance covariance matrix σ2 Xf (X 0 X)−1 Xf0 .
If we define the “point forecast error” as Yf − Ŷf , the expected value of this shall
be 0 and the variance covariance matrix σ2 (Im + Xf (X 0 X)−1 Xf0 ).
Be very careful not mistaking these formulas with the corresponding ones for Ŷ .
On the basis of these formulas and working either under strong OLS hp or using
Tchebicev, it is possible to derive (exact or approximate) confidence intervals for the
estimate of the expected value of each element in the new set of observations and for
the corresponding point forecast errors.
For instance, under the Gaussian hypothesis, the two tails confidence (α%) interval
for the expected value of a single observation in the forecasting sample, under the
hypothesis of a known error variance, corresponding to a row of values of Xf equal to
xf is given by:
q
i
h
xf β̂OLS ± z(1−α/2) σ
xf (X 0 X)−1 x0f
The corresponding confidence interval for the point estimate, that is, the forecast
interval which keeps into account the point estimate error, is:
q
i
h
xf β̂OLS ± z(1−α/2) σ (1 + xf (X 0 X)−1 x0f )
In the case the error variance is not known and it is estimated with the unbiased
estimate
ˆ0 ˆ
σ̂2 =
n−k
as described above, the
ponly changes to be made in the formulas are: σ substituted
with its estimate σ̂ =
σ̂2 and z(1−α/2) substituted with (n−k) t(1−α/2) that is: the
(1 − α/2) quantile of a T distribution having as degrees of freedom parameters the
difference between n and k: the number of rows and columns in X, the regressors
matrix in the estimation sample.
It is easy to see that the second interval shall always be bigger than the first, as it
takes into account not only of the sampling uncertainty in estimating β but also of the
uncertainty added by f .
104
9.9
a note on P-values
The standard procedure for a test is:
• Choose H0 and H1 (they should be exclusive and exhaustive of the parameter
space).
• Choose a size of the error of the first kind: α. Be careful: too little α implies
usually a big error of the second kind. Your choice should be based on a careful
analysis of the costs implied in the two kind of errors.
• Find a critical region for which the maximum size of error of the first kind is α
and, possibly, with a sensible size of error of the second kind.
• Reject H0 if your sample falls in the critical region and accept it otherwise.
This procedure typically requires the availability of statistical tables.
When computer programs for performing tests came to the fore two alternative
paths were possible.
Let the user input the α for each test and, as output, point out if the null hypothesis
is accepted or rejected on the observed dataset with that α.
Let the user input nothing, but give the user as output the value of α for which
the observed data would have been exactly on the brink of the rejection region. This
value is called the P value.
With this information the user, knowledgeable about his or her preferred α, would
be able to state whether the null hypothesis was accepted or rejected by simply comparing the preferred α with the value given by the program. If the researcher’s α is
smaller than the value given by the program this means that the observed data is inside
the acceptance region as would have been computed by the researcher so that H0 is
accepted. If the researcher’s α is bigger than the value given by the program this means
that the observed data is outside the acceptance region as would have been computed
by the researcher so that H0 is rejected.
Conceived for the simple reason of avoiding the use of tables, the P value has become, in the hands of Statistics illiterates, the source or numberless misunderstandings
and sometimes amusing bad behaviors.
The typical example is the use of terms like “highly significant” for the case of null
hypotheses rejected with small P values or the use of “stars” for indicating pictorially
the “significance” of an hypothesis. A strange attitude is that of considering optimal, a
posteriori, a small P value which, a priori, would never be considered a proper α value.
Sometimes, worst of all, the P value is considered as an estimate of the probability
for the null hypothesis to be true given the data. A magnificent error since in testing
theory we only compute the probability of falling in the rejection region given the
hypothesis and not the probability of the hypothesis given that the data falls in the
rejection region.
105
Please, avoid this and other trivial errors: the fact that such errors are widespread,
even in the scientific community, does not make them less wrong.
Exercise: under the hypotheses used for analyzing the t-test, find confidence intervals for single linear functions of β.
9.10
Stochastic X
In applications of the linear model to Economics and Finance only infrequently we can
accept the hypothesis of a non random X matrix. Typically the X shall be as much
random (that is unknown before observation and variable between samples) than Y .
Just think to the CAPM “beta” regression where the “dependent” variable is the excess
return of a stock and the “independent” variable is the contemporaneous excess return
of a market index which contains the same stock.
If X is random, the results we gave above about βbOLS are no more valid, in general.
Under the hypothesis of a stochastic matrix X we can follow many ways for extending the OLS results. Each of these ways means the addition of an hypothesis to the
standard weak OLS setting. Here I choose a very simple way which shall be enough
for our purposes.
We shall extend the weak OLS hypotheses in this way:
E(|X) = 0
V (|X) = E(0 |X) = σ2 In
In other words we shall assume that, conditionally on X, it is true what we assumed
unconditionally in the weak OLS hypotheses setting. It is clear that our new hypotheses imply the old one, not vice verse. Notice that the equality between conditional
covariance matrix and conditional second moments matrix is true only because we assumed that the conditional expectation of is zero. (See by yourself what happens
otherwise).
An immediate result is that:
E(Y |X) = E(Xβ + |X) = Xβ
and this property justifies the name “linear regression” for our model. In other words,
with a stochastic X and our added hypothesis our model becomes a linear model for a
conditional expectation: a regression function.
Let us see now if the OLS estimate is still unbiased in the new setting.
E(βbOLS ) = E(β + (X 0 X)−1 X 0 ) = β + EX (E|X ((X 0 X)−1 X 0 |X)) =
= β + EX ((X 0 X)−1 X 0 E|X (|X)) = β
106
Notice the use of the typical trick: the iterated expectation rule. Now compute the
variance covariance matrix.
V (βbOLS ) = V (β + (X 0 X)−1 X 0 ) = V ((X 0 X)−1 X 0 ) =
= E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) − E((X 0 X)−1 X 0 )E(0 X(X 0 X)−1 ) =
But the second term in the sum was just shown to be equal to 0 so:
= E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) = EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) =
Now the term E|X (0 |X) is, by hypothesis, equal to σ2 In so that:
EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 X 0 X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 )
In short: with a stochastic X and the new OLS hypotheses, βbOLS is still unbiased but
now its covariance matrix is fully unknown as it depends on the expected value of X 0 X.
We conclude this section by just a hint at two results of standard non stochastic X
OLS theory, one of which can and the other cannot (in general) be proved to hold in
the new setting.
First: with a suitable modification of the definition of “best” a Gauss-Markoff theorem still holds. Let us see how this works.
In case of stochastic X, the theorem proof (using OLS hypotheses for the case of
stochastic X) goes, now conditional on X, exactly as before. The last statement then
becomes
V (β̂|X) = σ2 ((X 0 X)−1 + CC 0 )
We must then find V (β̂). In the general case, this is NOT E(V (β̂|X)) as a second
term depending on V (E(β̂|X)) should be considered. However, since E(β̂|X) = β (due
to unbiasedness), this term is always equal to 0.
We then have
V (β̂) = E(V (β̂|X)) = σ2 (E((X 0 X)−1 ) + E(CC 0 ))
This is the varcov matrix of the OLS estimate when X is stochastic plus the expected value of an at least psd matrix. If the second term is at least psd too, we have
our proof. This is easy, we must show that for any non stochastic vector z we have
z 0 E(CC 0 )z ≥ 0 but, by the basic properties of the expected value operator we have
z 0 E(CC 0 )z = E(z 0 CC 0 z). We already know that CC 0 is at least psd, this is equivalent
to say that, whatever z, we have z 0 CC 0 z ≥ 0. The expected value of a non negative
number cannot be negative (why?) and we have our proof.
107
Second: the basic results underlying hypothesis testing do not directly hold in the
new setting, except if we only work conditionally on X (which somewhat vanifies the
scope of the new setting) and under the (strong) hypothesis that, conditional on X,
the distribution of is Gaussian. However for all those tests whose distribution is independent on X when X is not stochastic, as, for example, the T −test, the assumption
of conditional Gaussianity of given X implies that the standard result in the non
stochastic X case still holds.
More in general, hypothesis testing and the construction of confidence intervals
can, in general, only be done conditionally on X or by relying on asymptotic results
whose applicability in specific settings is always very difficult to state. For instance,
the standard confidence interval for a generic βj contains in its specification (X 0 X)−1
jj
which is stochastic if X is stochastic so that it shall be useful only conditionally to
X, for the simple reason that if you do not know X you do not know the extremes of
the interval itself (notice, however, that, since the probability of being in the interval
does not depend on X this probability shall still be 1 − α for ANY realization of X
so that the problem, if you work unconditional on X, is not that you do not know the
probability of being inside the interval but that you do not know the extremes of the
interval itself).
9.11
Markowitz and the linear model (this section is not required for the exam)
There exists a nice connection between the OLS estimates and Markowitz optimal
weights.
Let’s recall Markowitz’s formula, where µR is the row vector of expected returns and
µER the corresponding excess expected returns vector:
w = λΣ−1 (µ0R − rf 1) = λΣ−1 µ0ER
We are interested in deriving, using the OLS algorithm, a vector proportional tow or,
at least, to the estimated w where in the place of the unknown parameters we have
the corresponding estimates. We can then divide each element of this vector times the
sum of the elements and have weights summing to 1.
Let’s call R the matrix containing the n row k columns matrix on data on excess
returns and µ = 10 R/n the k columns row vector containing the corresponding mean
excess return.
Now let’s start with the OLS normal equations:
R0 Rβ̂OLS = R0 Y
R0 R is not the estimate of Σ, however:
nΣ̂ = R0 R − nµ0 µ
108
Let’s go back to the OLS normal equations and subtract to both terms nµ0 µβ̂OLS :
(R0 R − nµ0 µ)β̂OLS = R0 Y − nµ0 µβ̂OLS
That is:
nΣ̂β̂OLS = R0 Y − nµ0 µβ̂OLS
Now suppose Y = 1n :
nΣ̂β̂OLS = R0 1n − nµ0 µβ̂OLS = nµ0 − nµ0 µβ̂OLS = nµ0 (1 − µβ̂OLS )
Where the 1 in (1 − µβ̂OLS ) is a scalar. It is then clear that these normal equations
are a scalar multiple of :
Σ̂β̂OLS = µ0
Hence, the solution of these equations shall be a scalar multiple both of the solution
of the normal equations for OLS with regressors R and dependent variable 1 and of the
Markowitz weights equation. Since they are both scalar multiples of the same vector
once normalized dividing each term by the sum of their components we shall get the
same normalized vector, in other terms:
β̂OLS
10 β̂OLS
=
w
10 w
We have just given a proof of the fact that the Markowitz weights are proportional
to the OLS β estimate for the model:
1n = Rβ + That is: Markowitz weights are (proportional to) the OLS solution of the problem:
“find the linear combination of the columns of R which best approximate a constant”
(1, in our case, but any non zero constant will do as well). This should not be surprising
as Markowitz weights minimize the variance of the risky portfolio.
9.12
9.12.1
Some results useful for the interpretation of estimated coefficients
Introduction
The proper use of regression, as of any statistical tool, requires both a correct understanding of its mathematical foundations and of its meaning. These two aspects are
connected but they are not equivalent.
A rather long experience in teaching this topic has induced in the author of these
handouts a strong belief in the fact that the “meaning” part, not the mathematical
foundations part, is the difficult one.
109
This section is dedicated to try and at least partially deal with this point.
The section is by far longer than all the other sections dedicated to linear regression
put together. This is a clear hint that our objective is not trivial.
A word of warning is needed at the beginning of this (difficult) path: we shall try
to shed some light on one important use of regression and, due to the limits of an
introductory course, we shall only hint to another, maybe even more important, use.
With some approximation we can say that a regression model is usually applied to
two different problems: forecasting some variable on the basis of info on other variables
and forecasting the “effect” on some variable of the manipulation of other variables.
The term “regression”, as used in Probability and Statistics, has mostly to do with
the first problem. The solution of the second problem, as we shall see, requires many
assumptions which are extraneous to the sole fields of Probability and Statistics and
deal with the core of the specific subject, in our case Economics and Finance, in which
Probability and Statistics statements are made. It is very important never to mix up
the two purposes .
In what follows we shall dedicate a quite complete if simple analysis to the “forecast”
use. We shall not enter in detail, but only hint to, the “effect of intervention” use which
is, obviously, of the utmost relevance in Economics and Finance.
Our choice is not due to the relevance of the first use wrt the second, but to the
fact that the “effect of intervention” use would require, for a thorough exposition, a
course in itself40 .
Let us try and give some simple if approximate characterization of these two uses
of regression41 .
First use: forecasting.
Given some information, that is: observed values of variables, we would like to
forecast in a sensible way the value of other variables. This is obviously very useful for
our decision making: to have an idea about future weather on the basis of observed
meteorological data could be of some help in deciding where to pass our holidays;
to have an idea about the possible returns in a population of stocks, on the basis
This distinction and the debate between the two problems is at the core of the birth of Econometrics between 1930 and 1950. While many authors discussed the topic at the time, it is possible to
single out the central role of the 1943 paper by Trygve Haavelmo: “The Statistical Implications of a
System of Simultaneous Equations”, Econometrics, 11, 1, (Jan. 1943), pp. 1-12. This paper contains,
in maybe sometimes very summarized but quite clear version, most of the arguments that would have
been discussed during the following 75 years and of which a simple version is presented in these pages.
A study of this paper, while not required for the course, would be very useful for any Student wishing
for a more deep understanding of the role of statistical inference in Economics and Finance. Another
important reference in the same time frame is the Cowles Commission series of monograph and, in
particular, the 10th of these: “Statistical Inference in Dynamic Economic Models”, Wiley, (1950).
41
On this topics, for a view substantially similar to the one introduced here, see, e.g. David Freedman
(1997) “From association to causation via regression”, Advances in Applied Mathematics, 18, 59-110.
In this paper Freedman also consider a use of regression only hinted at in what follows: regression as
summary of data.
40
110
of observed characteristics of these, could be of some help in deciding our portfolio
composition; to have some idea about the probability of default of some company on
the basis, say, of its balance sheet, could be of some help in deciding whether to invest in
it. In each of these cases we may use observations of available data in order to compute
forecasts (maybe expected values) of data we still do not know and take decisions.
In all these examples, we want to forecast events which, at the moment, are unknown
to us, on the basis of available information. Note: “forecasting” does not necessarily
concern the future: it concerns, more in general, using available info in order to infer
about still unavailable info.
Suppose you must decide whether a suspect did commit a crime, we base our
evaluation on available info (clues) today, while the event is in the past. In assessing
the origin of a medical condition, we shall base our analysis on available symptoms.
Many relevant statistical aggregates, e.g. GNP, employment, industrial production and
so on, require years to be measured in a reliable way. Routinely, national statistical
services produce estimates of these aggregates which are, actually, “forecasts” base on
partial and incomplete information on available economic indicators.
All these cases have two aspects in common: you have some information and use
this to forecast “missing information”. You do this because, typically, you must act on
the basis of this information.
Notice that a common characteristic of the above examples is that your act is not
likely to affect in any way the phenomenon you are analyzing: you get information and
use it but do not in any sense “act” in a way that could change the available information
or in any way affect the phenomenon you are studying.
The first point is clear cut, the second a little bit more delicate. While it is clear
that your decision about where to pass your holidays is reasonably NOT going to affect
weather and your decision about the possible cause of a medical situation is NOT going
to change the situation itself, your choice of investment, for instance, could have some
“effect” on the performance of a given stock or company if you are, say, a big and
authoritative investor.
We could think of two kinds of effects. The first is straightforward: if the investor
intervenes in the market on the basis of the forecast model, a sizable market action
could have effect on the price of the stock. The second is more “strategic” and is typical
of the Economics field: suppose the company knows about the forecast model and, for
any reason, intends to give a “good impression” to the investor42 . This could imply
Suppose, for instance, that asset selection models give positive weight to, say, a lower leverage.
This could induce a company to alter its leverage in a way which would be economically meaningless
had not the model been in place. Another example: risk management regulations set limits on some
variables related, for instance, to the overall exposure of a bank to some category of risky asset. These
rules where calibrated considering, also, historical data about the connection of these exposures and
risk on the basis of a risk model. However, once the rules are acting, their simple existence could
imply a strategic response on the part of the bank, altering its risk exposure in a way that could alter
the ability of the model underlying the rules themselves to correctly forecast risk.
42
111
an action on the part of the company in modifying some of its characteristics. This
action would not have been there, had not the model been used and could degrade
the forecasting abilities of the model by altering the distribution of the firm observed
characteristics wrt the one valid when the model was estimated.
This line of thought, while relevant and quite interesting, would require a complex
analysis well beyond the purpose of these introductory notes. Just keep in mind that
the point is relevant43 .
Leaving this delicate point aside, just consider this first use of regression as a
“forecasting” use followed, maybe, by an act both of which (forecast and act) do not
interfere with the phenomenon under analysis.
In this setting, as we shall see, regression is a very good tool for forecasting (in a
very precise sense of “good”) and, if properly considered, its use in forecasting is, under
reasonable and very clear hypotheses, easy to apply and to understand. Moreover, and
this is very important from the practical point of view, we shall find simple and reliable
“a priori” measures of the quality we may expect the forecast to have.
Let us now go to the second use of regression: evaluating the effect of intervention/policy/treatment etc.
Sometimes we would like non only to be able to make forecasts given information,
we would like to actually influence the variables/system we forecasts by “altering” the
information.
In other words: we are still looking for forecasts but, now, we want to forecast
what could happen if we acted on some of the conditioning variables involved in the
regression is such a way to choose their values.
Consider the following example: we have data on the returns for the stock of several
companies and data on, say, a set of balance sheet indicators for the same companies.
If we observe a correlation between returns and indicators, e.g. a negative correlation
between return and leverage, we could use indicators in order to forecast returns and,
maybe, invest in lower than average leverage companies. This is a forecast use and,
in order to work, only requires that the correlation is stable enough in time. If our
One on the most interesting and most delicate aspects of Economics is the study of these strategic
responses and how to take account of these in devising rules and policies.
Notice that this is not something happening or considered only in the Economics field. One of the
reasons why, in medical experiments, both treated and controls are treated in a symmetric way, such
that nobody, patient or doctor, knows if what was given is treatment or placebo, is to avoid possible
“treated/no treated dependent behaviours” which could alter the result of the experiment.
43
It should be clear, for instance, that such line of thought involves some specific definition of
“causality” thru the idea of the “effect of an intervention”. Any use and discussion of the word “cause”
implies dire problems which plagued science and philosophy since their inception and this, surely, is
not the place to discuss these matters in any dept. Consider, however, that the common idea (for
instance in Economics and medicine) of “measure of the effect of an intervention”: raise rates, assign
a medical treatment, has very little to do with the notion of “causality” in classical Physics, where
the word implies no intervention or action but the forecastability of a physica system future give its
present state.
112
intervention in the market is not huge, it is reasonable to assume it shall not alter the
forecast itself.
However, we may also wish to use the info in a different way: we may be tempted
to try and influence the stock return of a given company by altering its leverage. In
doing this we are no more just observers of a phenomenon, we intervene by “deciding”
the value of one or more variables involved in it.
Suppose we do this and “forecast the effect” of our action, based on the observation
on the “non tampered” phenomenon.
It is clear that our action shall have any hope of getting results in line with this
forecast only if some kind of “invariance to intervention” hypothesis, for the relations
among the variables of interest when we observe them and when we act on them,
holds AND if the consequences of our action on all the relevant variables used for the
“forecast” is well understood. This second, very delicate, point is often subsumed in
the idea that I can study the “effect” of my action on, say, a variable, as if this would
not imply changes in other relevant variables: the “coeteris paribus” idea.
It may well be that simple observational data, potentially useful for simple forecasting, shall be useless to this different purpose and that we should instead create
novel ways to find information (e.g. experiments, when possible, or other “intervention
friendly” ways).
Consider some simple instances.
Suppose we increase the leverage of our company because we observe that, in our
dataset, by regressing the stock return on a set of balance sheet indicators including
leverage, we get a positive estimate for the parameter of leverage.
What we observe, the source of our estimates, are data on companies where, presumably, leverage is set at an equilibrium value compatible with the other variables
characterizing the company. For these companies its is true (this is what we see in the
estimates) that the expected return of a company with the same values for the other
variables of another company but a higher leverage is higher.
Useful from the point of view of forecasting.
However, we are now altering one of these variables, leverage, without consideration
for the rest, may put the company out of equilibrium, this may alter the relationship
among the different variables and among these and the stock return. The result is
difficult to assess and, probably, would require a much deeper analysis. In any case it
could be completely different than what expected.
Now, suppose the overall relations among the variables are not disturbed by our
action. This not withstanding our action, by itself, may alter the value of other relevant
variables. The “coeteris paribus” clause would not be valid and this is very likely in
an equilibrium setting where variables must adapt to changes in other variables to
maintain equilibrium. If this happens, what is relevant is the “net effect” of our action,
keeping into account all interrelations among variables. This could be the opposite of
what intended but, this the important point, to state the overall result would require an
113
analysis of the interrelationships among the different variables which goes well beyond
a simple regression.
Just as an example: in a regression with observed data, and dependent variable
the firm stock returns, we may find a positive parameter both for leverage and, say,
inventory to sales ratio. However, it may be that an (off equilibrium) increase of
leverage has a negative effect on inventory to sales and the net effect on the expected
stock return could be less positive then expected or even negative.
A last point (which is, actually, a special case of the first). It may also be that the
observed correlation between leverage and stock return is due to the fact that both
variables are positively correlated with a third variable, observed or not. This does
not alter the usefulness of the regression as a forecasting tool, but implies that any
manipulation of the leverage shall be ineffective in changing expected returns. The
information given by the tachometer is surely useful to evaluate the speed of our car
(which we do not directly measure. However, it is not the case that, by moving the
tachometer needle (old car, analogue meter) would not change the speed of the car,
and, probably, would break the tachometer so that useful info would be lost.
In this case both the speed of the car and the tachometer lever depend on the rate
or rotation of the car’s wheel which depends, say, on the gear and the gas, for given
conditions of the road.
This simple example is not so far from economic intuition. In equilibrium the level of
interest rates is a measure of the convenience to invest. If a country, maybe in recession,
decides to lower interest rates, maybe intervening in the market to buy its debt, puts
interest rates out of equilibrium. It may be that the resulting decrease of rates allows
for more investments, however the new investments are such that they would not
have been done in the previous situation. This could have been due to the scarce
expected productivity of such investments. It is then questionable that “inducing” such
investments may improve the overall economic situation. This example is simplistic
but it may give an idea of the problems implied in an “intervention” analysis.
Many other examples are possible and we shall go back on these points in what
follows. What is to be kept in mind is that both purposes: simple forecast and “intervention effect measure”, are interesting and practically important but, while the
requirements for the forecast use of a regression to be useful are quite simple, clear cut
and, in summary, limited to the invariance of the distribution of observables, this is
not true or simple when we are interested in forecasting effects of intervention.
In what follows we shall give some basic, but rather complete hint, toward a correct
reading of regressions as forecast tool and only hint at the “intervention” use with the
purpose of giving some criterion to distinguish between the two.
This is an introductory course and we cannot even try a summary of what is needed
for a correct “action/intervention” analysis. For this reason, our sole purpose in this
regard is to make as clear as possible that forecasting and intervention analysis are
different and partially non connected purposes. This is a well understood fact since
114
the beginning of the study of Econometrics and, in fact, most of what distinguish
Econometrics from other fields of application of statistical inference can be traced to
different attempts toward dealing with these two purposes.
To understand this is of paramount importance, in order to avoid PRACTICAL
errors with DIRE PRACTICAL consequences.
The plan of this section is as follows:
We begin by a simple analysis of the basic properties of a regression function.
We follow this by trying and characterizing, in some more detail, the difference
between a forecast and an intervention.
We conclude presenting a simple procedure for analyzing a regression (and in particular a linear regression) as a forecast tool.
9.12.2
Some property of the regression function (*only the definitions and
properties are required for the exam, not the proofs of the properties. These are, however, a useful exercise)
In what follows we shall always consider linear models as models of linear regressions.
In other words, we shall suppose that a “set” of random variables (in general a matrix)
Z (maybe only partially observable) has a probability distribution P (Z), that we want
to study the conditional expected value (aka regression function) of one vector of Z,
that we indicate with Y , conditional to a matrix of other elements of Z, that we call
X (so that, obviously, we suppose such conditional expectation to exist) and that we
suppose this conditional expectation to be expressed by E(Y |X) = Xβ, where β is a
vector of non random parameters.
It should be obvious that, out of a given Z, we could compute and be interested in
many conditional expectations which we could derive from P (Z) and that, as already
stated, we could be in the condition of not having complete knowledge of the elements
of Z. From the point of forecasting the choice shall be based on what we know and
what we wish to forecast in Z. Just to stress the different objective: in case of an
intervention analysis this quite arbitrary choice would be totally unwarranted.
While it is not strictly necessary, just for simplicity, we shall suppose, when necessary for simplifying results, all “interesting” regression functions to share the property
of being linear.
Let us recall some property of regression functions in this context (remember: in
what follows we avoid mentioning mathematical details which could be quite difficult
from the technical point of view, while almost irrelevant from the applied point of view.
Hence, our results shall be “bona fide” versions of more complex and sometimes cryptic
results).
Most of these properties are already recalled in the appendix: 19 but it is useful to
recall them here and comment on them a little more in dept.
115
We begin with general properties which only require the regression function to exist,
be it linear or not.
1. Suppose Y = g(X). That is: Y is a function of X. Then E(Y |X) = g(X).
2. E(E(Y |X)) = E(Y ). In this you must understand the meaning of the two
E signs. Each expected value is made wrt a different distribution (but both
distributions derive from the same P (Z) otherwise the result does not hold).
The “outside” E is wrt the distribution of X, that is P (X) the “inner” E is wrt
the conditional distribution of Y GIVEN X. In general E(Y |X) is a function of
X and for this reason you take the outside expectation wrt the distribution of
X. This function is what is called “regression function” (of Y on X) and in our
vector case this “function” is a “vector function” (or, better, a vector of functions).
In the case the regression function is a constant (that is: it has the same value
conditional to different values of X), due to the above result, this constant must
be E(Y ). In this case we say that Y is regressively independent from X. Now let
us reconsider the outside E. We said this expectation is wrt P (Z) but now we
understand that what is inside, the regression function, is only function of X. To
take the outside expected value, that, we only need P (X) which we could derive
from P (Z) with the usual trick of “integrating out” or “summing out” (continuous
and discrete case) the unnecessary variables.
3. E(Y − E(Y |X)) = 0 this is a corollary of the second property and the interpretation of the E. Here you see, and this is very useful for a good understanding,
that the interpretation of the outside E as wrt P (Z) is relevant. In fact, the first
obvious step in proving the result is: E(Y − E(Y |X)) = E(Y ) − E(E(Y |X)) and
then, since the second property is that E(E(Y |X)) = E(Y ) we have the required
result. This is correct but it implies that, when we write E(Y − E(Y |X)), the
outside E is wrt P (Z) otherwise, if you could only compute E wrt P (X) it would
be impossible for you to compute E(Y ). You can forget this and use the simple
additive rule of the expected value symbol plus the first property, but if you really
want to understand what you are doing it is better you consider the point at least
once.
4. E((Y − E(Y |X))E(Y |X)0 ) = 0 (notice the ’ sign and recall we are here considering vectors and matrices). Let us do it step by step. E((Y −E(Y |X))E(Y |X)0 ) =
E(Y E(Y |X)0 ) − E(E(Y |X)E(Y |X)0 ) Up to this point nothing new. Now concentrate on the first term of the difference and use the first property “in reverse”: E(Y E(Y |X)0 ) = E(E(Y E(Y |X)0 |X)) this seems to make things more
complex but we should recall that E(Y |X) is a function of X so that (property one) E(E(Y |X)|X) = E(Y |X). Use this and get: E(E(Y E(Y |X)0 |X)) =
E(E(Y |X)E(Y |X)0 ) where E(Y |X)0 is taken out of the inner conditional to X
because it is a function of X. With this understanding we have the required
116
result: if you compute the covariance matrix of the “forecast errors” Y − E(Y |X)
and of the “forecasts” E(Y |X) you get 0. You should be able to understand why
we are calling E((Y − E(Y |X))E(Y |X)0 ) “covariance matrix” and you should
realize that here we are using the terms “forecasts” and “forecast errors”.
Now we are ready to state and prove a very simple result of paramount relevance.
This result justifies the use of the regression function in order to “forecast” Y on
the basis of X.
Consider any (vector) function h(X) (the vector has the same dimension as Y ).
Suppose you want to “forecast” Y using h(X) and you want this forecast to be “the
best possible”.
The regression function is a possible candidate as it IS a function of X (and call
E(Y |X) = φ(X)) but there exist, in general, infinite other possibilities.
You measure the “expected forecast error” “size” with its mean square error matrix44 :
E((Y − h(X)(Y − h(X))0 ).
We would like this to be as “small” as possible: this seems a sensible idea. However,
this is a matrix, so that we must define “small” in a non trivial way.
Here we follow the Gauss-Markoff definition and we look for a choice of h(X) =
h∗ (X) such that any other choice would yield a MSE matrix whose difference from that
corresponding to h∗ (X) is (at least) PSD, that is:
E((Y − h∗ (X)(Y − h∗ (X))0 ) = E((Y − h(X)(Y − h(X))0 ) + H
where H is PSD.
We are about to show that the “best” way of choosing h(X), so that to make
E((Y − h(X)(Y − h(X))0 )
the “smallest” in this sense, is to choose h(X) = φ(X).
Begin with the identity:
E((Y −h(X)(Y −h(X))0 ) = E((Y −φ(X)+φ(X)−h(X))(Y −φ(X)+φ(X)−h(X))0 ) =
= E((Y − φ(X))(Y − φ(X))0 ) + E((φ(X) − h(X))(φ(X) − h(X))0 )+
+E((Y − φ(X))(φ(X) − h(X))0 ) + E((φ(X) − h(X))(Y − φ(X))0 )
We are now going to show that the last two terms of these sum are both equal to
0.
They are one the transpose of the other, hence we can just prove the result for any
one of the two.
Take for instance the first one and use the first property, additivity and the second
property, you get:
E((Y − φ(X))(φ(X) − h(X))0 ) = E(E((Y − φ(X))(φ(X) − h(X))0 |X)) =
44
In general this is not the variance covariance matrix as we are not requiring E(h(X)) = E(Y ).
117
= E(E(Y |X)(φ(X) − h(X))0 − φ(X)(φ(X) − h(X))0 )
Now: since E(Y |X) = φ(X), inside the expected value we have the difference of
two identical matrices, that is: a matrix of zeroes.
In the end we have
E((Y −h(X)(Y −h(X))0 ) = E((Y −φ(X))(Y −φ(X))0 )+E((φ(X)−h(X))(φ(X)−h(X))0 )
The first term in the sum does not depend on your choice of h(X).
Consider the second. If we recall the definition of semi positive definite matrix, we
see that ((φ(X) − h(X))(φ(X) − h(X))0 ) is PSD. This implies that also E((φ(X) −
h(X))(φ(X) − h(X))0 ) is PSD.45
The “best” you can do, then, is to set this term to 0 by choosing h(X) = φ(X).
Any other choice gives you a bigger, in the Gauss Markoff sense, variance covariance
matrix.
Summary of the result: the regression function is the “best” (in this particular sense) function of X if your aim is to forecast Y .
Note: If, instead of E((Y − h(X)(Y − h(X))0 ), we decide to minimize E((Y −
h(X)0 (Y − h(X))) that is: the expected sum of squared errors of forecast, a proof
following the same steps as the above proof, just changing the position of the transpose
sign, shows that the regression function minimizes the expected sum of squared errors
of forecast. In this case the objective function is a scalar so the term “minimizes” has
the usual sense.
All these results do not require the regression function to be linear.
We need linearity in order to state and prove the following important property.
Suppose X1 is a subset of the columns of X and suppose (linearity) that E(Y |X) =
Xβ and E(X|X1 ) = X1 G where I use an uppercase letter here (G) because X is a
matrix so that E(X|X1 ) is a matrix of regression functions.
We then have E(Y |X1 ) = E(E(Y |X)|X1 ) = E(Xβ|X1 ) = X1 Gβ.
As stated above, given a choice of Y many regressions are possible if we condition Y
to different sets of X. However, there shall be a connection between these regressions.
This simple results allows us (for a linear regression) to compute the coefficients of
the regression of Y on X1 when we know the coefficients of the regression of Y on X
and X1 is a subset of X. A more general version of this result, the partial regression
theorem, shall be discussed in what follows.
This is because the expected value operator has the “internality property”, that is min(z) ≤ E(z) ≤
max(z) for any random variable z. Since the matrix ((φ(X) − h(X))(φ(X) − h(X))0 ) is of the kind
P P 0 , this is a PSD matrix (whatever be the set of values X takes). This means that, whatever the
vector of numbers v we have v 0 ((φ(X) − h(X))(φ(X) − h(X))0 )v ≥ 0. So, by the internality property
E(v 0 ((φ(X) − h(X))(φ(X) − h(X))0 )v) ≥ 0 whatever be v, but v, while arbitrary, is non stochastic,
hence we can take it out of the expected value operator and we get v 0 E((φ(X) − h(X))(φ(X) −
h(X))0 )v ≥ 0 whatever be v. This is what we wanted to prove. Notice that by saying PSD we are not
excluding the matrix to be PD, the point is that we only need to prove it is at least PSD.
45
118
9.12.3
Forecasting versus intervention
We now know something more about the regression function. In particular we know
that regression is, in an interesting sense, an “optimal” choice if we want to forecast Y
when we know X.
It is, therefore, not surprising that, when we are concerned with such a forecasting
purpose, we liberally use it.
To forecast, in this context, means simply this: suppose you observe the variables
X on some subject/individual, you are interested in forecasting the value of another
variable Y and this interest depends, as a rule, on the fact that the value of Y is
unknown. If you want the forecast to minimize (in the sense considered above) a
statistical measure of the error as E((Y − h(X)(Y − h(X)0 ) you are going to use as
forecast E(Y |X), that is, if you know the functional form of E(Y |X) you simply input
X in it and get as output the conditional expectation/forecast.
You do not act in any way to choose, alter or set the value of X and any decision
you take, as a consequence of your forecast, is supposed not to alter the regression
function.
It is to be stressed that it is not necessarily the case that Y follows in time X. A
forecast is useful any time X IS KNOWN BEFORE (not: “happens before”) Y . For
instance: X could be some statistical series easy to measure or observe, while Y may
be much more difficult and costly to observe even if contemporaneous to X or even
“happened” before X. In this case the use of E(Y |X) as a forecast of Y given the info
on X is very useful in order to save time and money.
For instance: the GDP variable in national accounts is, as a rule, first computed
using a set of proxies, easy to observe, and then updated an made more precise during
time (years) when more detailed info becomes available.
In order for X to be useful to forecast Y , what is required is that E(Y |X) be a non
constant function of X.46 While it could be of interest to understand why this may be
the case, it is important to understand that, at a first order of approximation, if all we
are interested in are forecasts, it is not so important to answer this question, we can
just use this fact.
A forecast is only about information, it is about using what we know in order to
say something about what we do not know.
Sometimes, but this is just a particular case, the “information linkage” between
variables may have something to do with some causal (in any intuitive sense) connection
between variables: “if I know the cause i should know something about the effect”.
However, it is also true that if I know the “effect” I can say something about
“Useful” here means V (Y ) > V (Y |X) that is: the knowledge of X “reduces the uncertainty” on Y .
Recall the identity V (Y ) = E(V (Y |X)) + V (E(Y |X)) since both rhs terms are non negative, in order
for V (Y ) > V (Y |X) it is necessary that V (E(Y |X)) > 0 that is: E(Y |X) must be a non constant
function of X.
46
119
the “cause” (just think about how a MD or a police detective works deriving from
symptoms/clues hypotheses on the medical condition/culpable).
In conditional expectations terms, forecasting has nothing to do with the “direction”
of such possible, but not necessary, causal connection: we can try and forecast the
“effect” given the “cause” or the “cause” given the “effect”. The point is: what do we
know, and what do we want to forecast.
Let us go back to the example concerning the speed of a car. Let us say that the
“true speed” of your car is that measured by a roadside Doppler radar while what we
can observe is the car’s tachometer. If we suppose that both tools are well calibrated,
and the conditions of the road reasonable, we expect both tools to give similar values to
“the speed of the car”. You can then use any of the two in order to “forecast” the other
and the forecast should be quite good (meaning: high R2 ). You choose which forecast
to use on the basis of available info. As the driver of the car, you may be interested
in the forecast you can make using your tachometer in order to avoid breaking speed
limits.
It is also clear that there is no “causal” connection between the two measures, at
least not in the sense that, by altering the reading of one of the two instruments you
can alter the other. If, for instance, you break the plastic on the instrument panel of
the car and stop the tachometer arrow with our finger (we suppose analogue dials) this
is NOT going to limit the scale of the radar measurement, and vice versa.
In this case you have very good forecasts, provided you do not mess with the
instruments. You can make forecasts conditioning both ways, according to what you
know. But such forecasts do not imply any causal connection, at least in the sense that
you could alter one measure controlling the other.
Since we know what is happening, an economist would say: “we know the structure
of the economy”, we know the reason for this. The two dials measure correlated phenomena. The radar measures the speed of the car wrt the radar itself, the tachometer
measures the rolling rate of the tire. If the car is running on a reasonably non skidding
surface (not on ice) the two should be highly correlated, hence our forecast ability.
It is interesting to notice that, if we are only interested in forecasting, we may do
without such understanding and only suppose the informative relationship is stable. In
principle, under stability, we may forecast even if we do not “understand” is the sense
that “we have no idea whence the correlation comes”. This informal idea of “stability”
has several names in Statistics. In a very simple and constrained sense, it is called
stationarity. The idea of i.i.d. random variables is a particular case of this. More in
general the relevant idea is that of “ergodicity” which is quite beyond what we do in
this course.
We can go further: it is clear, in an intuitive sense, that the rolling rate of the tire
“causes” the tachometer measure (even on skidding surfaces) and the Doppler radar
measure (non skidding surfaces) in the sense that if I alter the rate, maybe acting on
the gas pedal or on the brake pedal, I expect the dial of the tachometer and of the
120
radar to move in a precise direction. It is also clear that this is not true in reverse: I
cannot speed up by moving with my finger or other tools any of the dials.
This not withstanding, we are using the “effect” (the position of the dial) in order
to “forecast” the “value” of the “cause” (the rolling speed of the tire). This perfectly
sensible and is going to work, obviously, if I do not tamper with the dial.
As mentioned in the introduction of this section, to use “effect” in order to “forecast”
“causes” is a quite common procedure. Consider a case where “cause” is in the past
while “effect” (as usual) is in the future.
The information we have about extinct living beings comes from their fossilized
remains which are available today.
We can say something about the shape and behaviour of extinct living beings
“conditioning” on the information we can derive from what today is a fossil. However,
in no sensible meaning, fossils “cause” the existence in the past of now extinct living
beings. I may destroy all fossils today, this would be very foolish and would not alter
the past of life on our planet. Maybe it would alter our understanding of this past
and this could be the objective of the (reasonably mad) fundamentalist/paleo-terrorist
involved in the destruction. This, however, is another story.
Once we understand that a forecast has only to do with an information linkage
between variables under which there may be or not causal relationship, we may consider
a second point and try to shed more light on the difference between simple forecasting
and attempt to intervene on the result of a phenomenon.
When you compute E(Y |X) for a given subject for which you observe X, a set of
variables, you simply put the observed X in the function φ(X) = E(Y |X) and get
your forecast. If X is made of many different measures, there is no much interest in
measuring the “contribution to the forecast” of each variable in X.
To make things simple, suppose Y is a single variable and X is made of X1 and
X2 , suppose E(Y |X1 , X2 ) = α + β1 X1 + β2 X2 where, for instance α = 0, β1 = 1 and
β2 = −1.
This simply means that, if in a subject you observe X1 = .4 and X2 = 1 your
forecast for Y in that unit is -.6, and if you observe, in another unit X1 = .4 and
X2 = 2 your forecast is -1.6.
You can surely say that, if you consider two different subjects with the same, say,
X1 and two different values of X2 where the difference in the two values is 1, the two
forecasts shall different of β2 , that is of -1 but this cannot be read in the sens that, if
in a unit you “change” the value of X2 increasing it of 1 than you are going to get a
forecast -1 smaller than before.
This is wrong for many reasons. Consider the tachometer example above: let us say
that to change X2 means to alter the position of the dial with your finger. It should be
foolish, in this case, to use as forecast function the conditional expectation computed
by observing data where the dial is not tampered with. If I actually did the experiment
of comparing the speed as measured by the radar with that of the non tampered and
121
tampered tachometer, I would see that, while a change in the untampered tachometer
dial corresponds (with some approximation) to a change in speed as measured by the
radar, this does not happen if the dial position is altered because it was tampered with.
This is totally obvious but contains a very important teaching: there are ways in
which, if I “change” some variable in the system I observe, the informative role of such
altered variable changes wrt the role it has in the untampered system.
For this reason, by itself, the observation of the untampered system, while useful
for forecasting, cannot tell me anything, in principle, about the”effect” of me tampering
with the system.
If our notion of “cause” is based, as usually is in Economics and Finance, on the
idea of “intervention”, we can simply say that forecast on the basis of information (be
it made using regression functions or other tools) in general tells us nothing about any
causal relationship.
This is even more evident when, as in the case of fossils, we observe X a long time
after Y did happen. While the observation of a fossil fish may be the indication that,
in the past, a sea existed where the fossil was found, while the observation of a fossil
sloth would imply that, in the past, some forest/savanna environment existed where
the fossil is found, if I swap the fossils I cannot expect to alter the past environments.
Again, this is obvious and, being obvious it should be always in the mind of any
researcher using regression.
In this section we did see that a regression is always (under some stability hypothesis) useful for “forecasting” and that “forecasting” has not necessarily to do with “cause”
but only with information.
What we are going to discuss in the following sections, is a simple but complete
way to read regressions as forecasts.
If, before doing this, you wish for some more hints the business of intervention, read
the following section (not required for the exam).
9.12.4
Some quick point about intervention (not for the exam)
While we are not considering, here, measures of causal effects, it could be useful to
dedicate some lines to this topic in order to at least understand how much more difficult
is this kind of job, even in the case we limit ourselves to a “cause” definition based on
the idea of “intervention” setting an arbitrary value to a variable in a system.
This idea of “measuring the effect of an intervention”, with its implied optionality,
owes much to fields the like of medicine, agricultural and biological experiments, based
of statistical experiments and randomization.
Notice that this has, for instance, nothing to do with the concept of “causal” in
classical 19th century Physics. In that setting “causality” had to do with the fact
that, in principle (at least this was the belief at the time), once the initial conditions
of a system of particles were known (positions and momenta), the future evolution
122
of the system was fully determined by these and the equations of motion. In this
setting an “intervention” is simply the perfectly determined results of an inevitable
interaction between two perfectly determined systems (in other words: there is no
possible “choice”).
Let us reprise, in a more detailed way the stylized example example regarding
interest rates.
It could be and it is of interest to study the connection between the level of interest
rates and, say, economic growth.
However, we also are very much interested in the “effects” of monetary “policy”. We
are interested, for instance, about the possibility of controlling growth by “manipulating” rates.
Are these two purposes (forecast and study of the effects of policy intervention)
equivalent?
This is by no means evident.
Suppose the economy is let to “work by itself” and we do not intervene. Interest
rates and growth are jointly determined, and neoclassical Economics supposes this
connection to satisfy some equilibrium condition (various exist) involving many other
observable and non observable quantities, in principle all those quantities that are
involved in individual choices which, on their turn, are modelled as those of optimizing
individuals possessing some utility ranking for the different possible results and aiming
to the best possible result according this ranking. The way such an equilibrium is
reached is, as a rule, not discussed but an axiom of the theory is that “nothing happens
if not in equilibrium”. The evolution we observe in the economic system is a continuous
change from an equilibrium position to an equilibrium position induced by external
factors (resources, technology) and, possibly, evolution of preferences.
This implies that, when we observe changes in variables, these changes always
satisfy the equilibrium axiom.
In our example, we may understand this “equilibrium” as described by the joint
probability distribution of interest rates and growth (and, arguably, many more variables).
We may suppose this distribution to be known. Or maybe, under some condition,
we may estimate this from data.
From our knowledge, or from observation we may, for instance, infer a negative
correlation between growth and interest rates.
The meaning of this, if we intend just a marginal correlation with no further conditioning, is that, we did, or should observe that, on average, while the economic system
transits from different equilibria, higher rates are accompained by lower growth and
vice versa.
This is equivalent to say that we can forecast higher growth when we know that
interest rates are lower and higher rates when we know growth is lower.
Which of the two forecasts (rates with growth, growth with rates) we choose to
123
make depends on our purposes and on what we know. In any case the forecasts shall
“work” if the correlation is stable and “work well” if it is high.
Does this imply that I can, say, increase growth by setting low interest rates? In
general no or, better, we can say nothing on this point on the basis of the sole knowledge
of observed correlations.
The question itself is puzzling: if everything is determined in equilibrium, how can
I choose the value of any variable. Am I, in some sense, outside the system, as, say, is
Nature (who chooses, say, crop yield and so determines an evolution of the economy
thru different equilibria), or am I in the system but, in some sense, I can set some
variable to a value that is not the equilibrium one and expect the system to reach a
new equilibrium with this added constraint?
(There is and was much discussion about this point).
Whatever the interpretation of the question the reason why an answer is impossible
on the sole basis of the observed (equilibrium) correlations is quite obvious.
The info I used to make forecasts came from the free working of the “economy”.
If I intervene and change rates, whatever the interpretation of my intervention, I am
tampering with this autonomous working and it may be that I substantially alter it.
In principle I should change the name itself of the “interest rate” variable when this
is set arbitrarily by me. It used to be “free equilibrium interest variable” now it is
“intervention interest rate”.
This change of name, by itself, could be useful in that it clarifies what is non
obvious in the use of, say, the regression function for growth conditional to free equilibrium interest rate in order to evaluate the “effect” of setting a particular value of the
“intervention interest rate”.
Are we doing something like pushing the gas pedal to increase speed, or are we
trying to increase the speed by acting on the tachometer dial? Or maybe are we even
changing the structure of the car as a system and maybe altering it to a point where
it breaks down?
It is important to understand is that we cannot decide this on the basis of the
simple observation of the “natural” correlation of interest rates and growth, where no
“action” on interest rates is in development.
This “natural correlation” is very useful if we try to forecast one of the two variables
using the other under “natural conditions”. It could be completely useless both for
causal effect measurement AND for forecasting if we intervene on one variable.
It may even be the case that we simply cannot intervene, even in a perfect causal
setting: there is correlation between temperature and the position of the sun wrt the
horizon but we cannot alter this position in order to have an effect on temperature.
There is correlation between age and income, this allows us, e.g. to forecast income
for any given age and, as a consequence, to estimate the possible path of income for a
given individual while time passes. But we cannot alter or intervene on age. And so
on.
124
Let us go back to “natural correlation”. A way to understand this is as follows.
Suppose Z is the set of “relevant” economic variables (observable or not). Let P (Z) be
a valid probabilistic description of the “economy left to itself”. Any intervention could
alter it in the sense that the probabilistic description of Z when one or more variables
are acted upon could be a different joint probability, say, A(Z).
The difference could be of many kinds, some less and some more relevant.
One very relevant difference would be the case that the regression function under
intervention be different than the “forecasting” regression functions, so that observations coming from “the world of P (Z)” could be of very little use to forecast some Y
when we intervene on some X. Notice that this could be also true in reverse: if your
data comes from intervention it may be you cannot use this for estimating forecasting
regression function.
There is another important problem. Suppose that P (Z) is not altered by the
intervention or, at least, the intervention does not change the regression function, that
is E(Y |Xintervention ) = E(Y |X) and suppose that X is not only informative on Y but
is also “cause” of Y (this is somewhat implicit in the assumption E(Y |Xintervention ) =
E(Y |X)).
Recalling properties 2 and 4 of the regression function, the errors of forecasts have
expected value 0 and are uncorrelated with the forecasts if the forecasts are made
using the regression function, in particular property 2 implies that forecasts done with
regressions are “unbiased”.
Now: we are here considering E(Y |X) but, obviously, there may be many other
observable or unobservable variables, say Z, useful to forecast Y . In general, for a given
value of X, E(Y |X, Z) depends on Z and so is NOT equal to E(Y |X). Now suppose
that your choice of action, Xintervention depends on some of these Z. In Economics,
for instance, it is not frequent the case that you choose policy by random assignment:
you intervene because there is some problem (values of some Z you do not like), in
medicine you do not treat a patient randomly, you intervene because of symptoms.
If this happens, however, it may be the case that, while E(Y − E(Y |X)) = 0,
property 2, and E((Y − E(Y |X))E(Y |X)0 ) = 0 (property 4) it is not true that E(Y −
E(Y |Xintervention )) = 0 and E((Y − E(Y |Xintervention ))E(Y |Xintervention )0 ) = 0 and this
is due to the concomitant “action” of Z.
In other words: even if we are in the best possible setting (there exists a causal relationship, not just an informative relationship, and action does not change it) we would
still be unable in quantifying the policy outcome by the sole knowledge of E(Y |X).
This implies a word of warning about the following, otherwise reasonable, suggestion: instead of using as data all available observations on interest rates and growth,
just limit your sample to those data points where rates were in fact determined by
policy actions. This would in fact be reasonable if we did think that such policy act
were not themselves at least partially induced by variables, observed or not, which by
themselves have an influence on growth. If this is not the case what we do observe
125
would not be connected with the “effect” of policy only, but also with the effect of such
(maybe unobservable) variables. The result could then be of little help, as we cannot
be sure, if we do not model the relationship, about which, for the policy action under
consideration, would be the value of such variables47 .
Consider the following example in a different field.
In medical experiments the standard procedure to test a new treatment is to select
a population of subjects with a given medical problem the treatment should help with,
say: high blood pressure. The population is randomly dividend in two parts and to one
of these the treatment is assigned while to the other a placebo is given. Neither the
patient nor the doctor knows whether placebo or treatment was actually given. Why?
Because the knowledge of being treated or not could alter the behaviour of the patient
in such a way as to “confound” the effect of the treatment.
For instance: a patient, knowing that all he got was a placebo, could be induced to
stay on a diet or to follow other therapies. At the opposite, a treated patient, knowing
of having been treated, could decide to spend his time eating hamburger and fries.
It is clear that, in this case, the result of the experiment would not allow for
a measurement of the effect of the treatment but, maybe, of the effect of the full
procedure of being in an experiment, being given a treatment and knowing about it.
Would this be useful for assessing whether to use or not the treatment as an approved cure for high blood pressure?
Now, suppose the experiment is performed in such a way that the patient does not
know whether the treatment of the placebo was used. In this way you could measure
the specific effect of the treatment.
Suppose, now, you want to translate this result into the effect of the treatment
in real world conditions. In real world conditions, doctors do NOT randomly treat
patients and patients know about the treatment. Patients are treated because they
feel ill, go to the doctor, and the doctor gives treatment. Is the “overall effect” of the
treatment still going to be comparable with the experimental result?
As you can see, lots of interesting problems and questions exist. In fact, in the last
century lots of research has been done on these topics and we know today many results
and procedures useful, case by case, to try and deal with them.
The analysis of these problems is at the core of the “statistical experiments” literature in medicine, biology and similar sciences since the beginning of 20th century and
of the “Econometrics” movement in the Economics field during the nineteen twenties.
Solutions exist but, in general, these require either observations coming from some
controlled version of the phenomenon, or very strong, untestable, assumptions on the
data, or a mix of both.
In fields like agriculture, medicine, biology and similar, researchers try and study
A(Z) by actually intervening on the variables in some controlled way via statistical
47
Some more detail on this point can be found in Appendix 21
126
experiments. The basic hypothesis here is that the distribution of Z under experimental
conditions, say E(Z) be, at least for the relevant regressions, similar to the “in vivo
intervention” distribution A(Z). In these field it is often the case that P (Z) has very
little meaning as we are interested in measuring “the effect” of variables which do not
even exist “in nature” (as most medical treatments). Still, when intervention is done “in
vivo” that is, not in an experimental setting (and this is the final purpose of research),
the problem is to relate the experimental result (governed byA(Z)) with the actual in
vivo result (governed by E(Z)).
In fields like Demography, Economics or Astrophysics, statistical experiments are
either impossible (we cannot create a new universe, cancel a star, we cannot change the
age of people while we can kill lots of them but, hopefully, would not consider this an
acceptable procedure) or practically irrelevant (because we cannot assume E(Z) to be
similar to A(Z)). In particular, in the case mentioned above: interest rates and growth,
we could, at least in principle, try and mock an experimental setting by, say randomly
assigning interest rates to different countries or different production units. However,
apart from the cost of this, it is difficult to expect that rational units would react to
this “strange mess” in any sense similar to what they would do under actual changes
of monetary policy. Moreover, since monetary policy induced interest rate changes are
done for specific purposes in specific economic conditions, it is quite unreasonable to
expect that the “effect” of randomly changed interest rates be in any way similar to
changes of interest rates with specific purposes and under specific conditions.
In general in Economics we shall most frequently use data governed by P (Z) (observational data) and hypotheses on how different could be A(Z), or at least the relevant
regressions under intervention. (This is called “structural modeling”).
In this course we shall not consider this point which is, obviously as we did try and
show, of the utmost importance. However, in discussing the examples at the end of
this section (with more details in the we shall have some opportunity of delving a little
further about this point.
In this section we tried to clarify why, while a “forecast” reading of a regression is
always reasonable, a view based on “forecasting the effect of an action” or, simply, a
“causal” reading is by no means reasonable in general or even possible without recourse
to a much more complex analysis and richer set of a priori assumptions.
For this reason here we are only concerned about how to read a (linear) regression
in a forecasting setting. This shall be in any case useful and, in particular, shall also
be useful also when it may be the case that the regression has a causal interpretation.
We shall not deal further about how and under which conditions such an interpretation
can be upheld.
9.12.5
Reading a probability model: a two level procedure
Let us begin with a general principle.
127
A good reading of a linear model, as of any statistical model, should begin by
splitting the procedure in two parts.
First: understand the meaning of a regression coefficient independently on statistical inference. That is: suppose all parameters in the model are known and identical
to the estimated values and learn how to read these.
Here what you need to understand is the meaning of the regression model itself,
without the added burden of parameter estimation.
Second: evaluate what in fact you really know, since parameters are only estimates.
That is: introduce a measure of sampling variability.
The difficult part is the first one. The second part is easy (if we understand the
meaning of sampling variability and, more in general, if we understand statistical inference).
The two aspects should NEVER be mixed and the subsidiary role of the second
w.r.t. the first, at least from the point of view of interpreting results, must be understood.
This is not something new or specific to the linear model setting. Whenever we
use a statistical model we must first understand the meaning of the model, without
bothering on problems of parameter estimation, once we are done with this we can try
and see how the fact that some parameter in the model is unknown and must be (and
hopefully can be) estimated influences our perception of the model result.
A simple example: r1 and r2 are daily returns of two stocks. We assume them to
be jointly Gaussian with expected values µ1 and µ2 , standard deviations σ1 and σ2 and
correlation ρ. We suppose to have observations on a (bivariate) time series of these
returns (n data points) and that observations in this time series are i.i.d. Due to this,
we can estimate the 5 unknown parameters with the “usual estimates” and get the
values .001 and .002 for the expected values, .01 and .01 for the standard deviations
and .2 for the correlation.
It is clear that these are NOT the true values of the parameters, just estimates, they
can change in other samples and we can quantify this potential changes by measuring
their sampling variability. This not withstanding, let us follow the suggestion above
and see what are the implications of the model when these values are seen as the true
values of the model parameters.
There is no such a thing as “general” model implication: when we speak of “model
implication” we must have in mind a specific practical problem. Let us say that our
problem is to build a simple long/short position by going long one stock and short the
other for the same amount. We must choose which stock to go long and which to short.
Mind: this “one stock short against one stock long” idea is NOT sound in general,
we should also find the right “hedge ratio” for the long/short position, which could be
different than one to one. Moreover, our understanding of diversification effects tells
us that it would be preferable to work with two portfolios, not with just two stocks.
This is not a lecture on “pair trading” (the practitioner’s name of such a position),
128
just an example.
If we simply look at the numbers, we see that we should go long the second stock
and short the first: same standard deviation but the second has double the expected
return.
Fine, but what can we expect from our investment?
In order to answer this question let us study the random variable r2 − r1 which,
albeit indirectly, describes the economic result of our investment.48
Using what we know about Gaussian random variables we see that r2 − r1 is distributed according to a Gaussian with expected value µ2 − µ1 and variance σ12 + σ22 −
2ρσ1 σ2 .
Taking the above estimates as true parameter values we have an expected value of
.001 and a standard deviation of .0126.
With these data it is easy to compute the probability of a positive return, over one
day, of our long/short position. This probability is .5315.
This is better than .5 (flipping a fair coin) but, still, it does not seem so exciting,
even if the expected return of r2 is double the expected return of r1 .
Obviously, the decision about what to do is up to the trader, however think about
the different sound of two possible descriptions of the trade.
“The expected return of the long position is double the expected return of the short”.
“The probability of a positive result is .5315”.
As an exercise, consider the same position over different time intervals and the same
position with different values of the parameters.
Now the inference part. The numbers we used above are not true parameter values
but estimates. As such we must quantify their sampling variability.
It is easy to compute V ar(µ̂2 − µ̂1 ) = (σ12 + σ22 − 2ρσ1 σ2 )/n, here n is the number
of observations (days) in the sample.
The variances and the correlations are unknown, in general, but for the sake of
simplicity let us suppose that they are instead known and equal to the values above.
√
This gives us a 95% confidence interval for µ2 − µ1 equal to .001 ± 1.96 ∗ .0126/ n.
Suppose you have 1000 observations (roughly 4 years of data). In this case the extremes
of the confidence interval are [.000216; .001784] and using standard terminology we say
that the difference between the two expected returns is “significant” at the 5% level
(i.e. the 5% confidence interval does not contain the 0).
Does this negate the above result? The answer is: not, both if you consider the
position interesting and if you consider the position too risky.
The “significant” result here only means that, even if the difference is small, in the
sense explained above, with enough observations (n big enough) we can distinguish it
These are log returns and so we should study: er2 − er1 which is the economic result, per unit,
of our investment in one time period (day). Mind: being a long short position of initial value 0 we
cannot speak of “return” of the position. We study r2 − r1 just because it is simpler and this is only
an example.
48
129
from 0 even if we take into account sampling variability.
This adds nothing to the properties of the position discussed above, which were
analyzed supposing that the estimates actually correspond to true values.
A more complete and interesting analysis of this result, which compounds the two
steps, could be performed but it is not in the scope of these handouts.
What should be avoided, if ever considered, is the following reasoning: since the
difference between the estimates is “statistically significant” this implies that we should
without any other consideration suggest taking the long/short position.
And think how much of an advertising you could get by simply adding, as before,
“and the expected return of the long position is double that of the short”.
All this would not change the fact that “the probability of a positive result is .5315”
(again, this computed supposing the estimates to be the true parameter values).
In conclusion: inferential procedures, the like of confidence intervals or significance
tests, cannot render “relevant” probabilistic results that are not considered so even
supposing that parameters are known.
Splitting the analysis in two steps: first assume estimates are true parameter values,
second discuss the precision of the estimates, can help you avoiding such mistakes.
As an exercise you should show a case where, due to the small sample size, a result
which could be considered practically relevant if parameters were known, is put under
scrutiny by the size of sampling variability.
Statistical inference is a tool for estimating parameters in a probability model and
assessing the amount of sampling variability. Statistics is NOT a tool for evaluating the
meaning or the importance of the results we get applying the model, such a meaning
always depends on the understanding of the model and of the phenomenon the model
describes.
What Statistics can do is just to offer measures of how much we can say about the
values of the parameters in the model on the basis of our sample.
9.12.6
Understanding a linear regression model as a forecasting tool. The
central role of R2
A linear model of the kind Y = Xβ + is not always describing a (linear) regression.
It does describe a regression if we assume, in some way, that E(Y |X) = Xβ.
So, for instance, if we do not suppose E(|X) = 0 the model is still a linear model
but we are NOT interpreting the model as a regression.
It could, then, be very interesting to analyze the nature of a linear model when it
does NOT model a regression, but we shall concern ourselves only with the case what
the linear model is the model for a regression.
A regression is, first and above all, a conditional expectation of one random variable,
given other random variables.
130
In our case we consider E(Y |X) or, better (row by row) E(yi |xi ) and in particular
we consider the linear case where we suppose E(yi |xi ) = xi β. (Here yi and xi are the
i − th rows of Y and X).
A regression is an optimal forecast of a variable given other variables. We now
know that “optimal” here means “minimizing mean square error”.
Let us recall the result:, a regression is a (vector) function of xi : E(yi |xi ) = φ(xi )
such that it minimizes the “mean square error”: E((yi − η(xi ))2 ) over all the functions
η(xi ).
As we did stress, this is very general and does not require the regression function
to be linear.
Given xi (be it a single variable or a vector of variables) there is no better way, in
the mean square error sense, to “forecast” yi than using the regression function.
Since the purpose of a regression is to minimize the mean square error, it should
be of interest to know how much such mean square error has been minimized in each
specific case.
We are in the setting of linear regressions, in the empirical version of a linear
regression (using data and not theoretical distribution) to minimize the MSE becomes
to minimize the sum of squared residuals. This is equivalent (when the intercept is
included) to maximize the R2 . This is the meaning of the variance decomposition result
from which we derived the R2 .49
Why is this discussion of optimal forecasts relevant for understanding the results
of a linear model?
Since a regression is a way to make forecast by minimizing a measure of error
the first, and always valid, “reading” of a regression must be first of all based on
summarizing ”how good” this minimization was.
In the context of linear regression this implies a first, simple, question: “how big”
is R2 .
The second question, usually is, “with this R2 , is the regression relevant”. The
term “relevant” creates many problems as, clearly, the answer shall depend on the
specific context. For this reason, while researchers do propose, e.g. reference values for
R2 under which a regression should be considered irrelevant, we suggest here a more
cautious path which tries to merge an “absolute” evaluation of the R2 with a more
specific connection to the specific practical context of the analysis.
We shall discuss this point further on with examples.
However we cannot and do not stop here.
It is almost always the case that we look for some further “decomposition” of the
“explained variance” in terms of each single “explanatory variable”.
The third step in the analysis requires to find ways to measure a variable by variable
measure of relevance.
We should be discuss the difference between the theoretical and the empirical variance but this is
not of much relevant here.
49
131
This is a fully legitimate question, if correctly understood. It is also dangerous
because, if not correctly intended, it is borderline to a “causal” and possibly wrong,
reading of a regression.
We shall be able to correctly understand a regression, if we shall be able to precisely
set the bounds under which this question has a meaning and if we shall be able to answer
to this question from within these bounds.
This measure, too, shall be based on the R2 .
The Reader should notice that the value of R2 depends on the full joint distribution of Y and X. If this changes, maybe because the conditions change between the
observation of different samples, or because the observation of the estimation sample
is made under different conditions than that of the sample whose values we want to
forecast, then the conclusions of the analysis may change (in principle, the regression
itself may change). This is an important point we do not have time to fully consider
in these handouts. Something more on this topic shall be discussed in section 9.12.12.
In what follows we shall be careful to apply a lot of “statistical restraint”.
In fact there exist many “simple” answers that would beso beautiful and direct, if
they were true. The problem is that wishful thinking is not, as a rule, good Statistics.
To path to a correct answer begins with the understanding of the following fundamental results.
9.12.7
The partial regression theorem. Partitioning R2
There exist two versions og the partial regression theorem. They are very similar because the proof is based on the strong mathematical similarity between two completely
different objects: frequencies and probabilities.
We first prove the “frequency based” version, that is: the partial regression theorem
valid for OLS estimates.
The second version has to do with “theoretical” regression functions, that is with
probability and can be seen as a direct application of the law of iterated expectetions
to a linear regression.
While quite obvious in terms of proof, the partial regression theorem tells us something which, maybe, is a priori unexpected: any given coefficient in a linear regression
is NOT a derivative with respect to the corresponding variable, in the common sense
of the term.
In fact, what a coefficient in a linear regression really is, is something of much more
interest and to understand this is fundamental in order to correctly interpret the result
of a regression.
Theorem 9.4. The estimate of any given linear regression coefficient βj in the model
E(Y |X) = Xβ can be computed in two different ways yielding exactly the same result:
1) by regressing Y on all the columns of X, 2) by first regressing the j − th column of
132
X on all the other columns of X, computing the residuals of this regression and then
by regressing Y on these residuals.
Proof. Write the model as Y = Xj βj + X−j β−j + where you isolate the j − th column
of X in Xj and put the rest in X−j . To make things simple suppose the intercept is in
X−j .
You estimate it with OLS and get: Y = Xj β̂j + X−j β̂−j + ˆ. Now write the
auxiliary regression: Xj = X−j γj +uj and estimate it with OLS to get Xj = X−j γ̂j + ûj .
Substitute this in the original OLS estimated model: Y = (X−j γ̂j +ûj )β̂j +X−j β̂−j +ˆ =
ûj β̂j + X−j (γ̂j β̂j + β̂−j ) + ˆ. By orthogonality of ûj with both X−j and ˆ we get
P
P
P
P
P
i )ûij = i û2ij β̂j so that β̂j = i Yi ûij / i û2ij
i Yi ûij =
i (ûij β̂j +Xi,−j (γ̂j β̂j +β̂−j )+ˆ
which, since the mean of ûj is equal to 0 (X−j contains the intercept) is identical to the
OLS estimate in a regression of Y on ûj alone.
A similar result is directly valid for the regression function (if we suppose all regressions to be linear, otherwise a similar but more general result is valid). The result
is valid without considering estimates, but directly properties of theoretical linear regression functions. In this case the statement of the theorem becomes:
Theorem 9.5. Any given linear regression βj in the linear regression E(Y |X) = Xβ
is identical to the coefficient of the regression of Y on Xj − E(Xj |X−j ) if we suppose
E(Xj |X−j )to be linear.
Proof. The proof mimics the proof based on estimates of βj and goes as this:
E(Y |X−j ) = EXj |X−j (E(Y |X)) = X−j β−j +E(Xj |X−j )βj = X−j β−j +X−j γXj |X−j βj =
X−j (β−j +γXj |X−j βj ) and E(Y |X) = X−j β−j +Xj βj = X−j β−j +(Xj −X−j γXj |X−j )βj +
X−j γXj |X−j βj so EX|Xj −X−j γXj |X−j E(Y |X)) = E(Y |Xj −X−j γXj |X−j ) = (Xj −X−j γXj |X−j )βj .
You should notice that, in this proof, regressions are required to be linear while the
proof concerning estimates only requires that the estimates to come from the use of
OLS in linear models.
Notice, moreover, that the proof with estimates is based on the algebraic properties
of OLS estimates: weak or strong OLS hypotheses are not required. In practice, the
only property used in the proof is that of orthogonality (with intercept included) which
directly comes from OLS.
This result, in both versions, is relevant because it immediately implies that each
βj is not connected with some “relationship” between the j − th column of X and Y ,
but only between the part of the j − th column of X which is (linearly) regressively
independent on the other columns of X and Y .
In other words: the linear regression model does not measure, in any sense, the
“effect” of a given column of X. Whatever the definition of such “effect”, this has
133
only to do with the part of this column variance which is uncorrelated with the other
columns.
The meaning of a regression coefficient for the same variable depends on which
other variables are in the regression and both the coefficient and the meaning change
if we change the other variables in the regression.
While a regression may have a causal interpretation, this is by no means necessary
or even commen. It is then important, when we speak of “effect” to avoid the impression
of speaking in causal terms.
We shall then define and measure the “effect of a variable” in a regression for what
it is and for what it is implied to be in the partial regression model.
For us, this is just the marginal “effect” or “contribution” of a column of X in
reducing the mean square error or, equivalently, improving the forecast performance,
when the other columns are accounted for in the sense of the above theorem.
This “effect” is to be better understood in “informational” term as the ability to
improve the quality of a fit and, it the inferential extension of fit to forecast is justified,
in terms of quality of forecast.
When the intercept is in the model, this is measured by the increase of R2 you get
if you add the Xj column to the model, or, equivalently, by how much the R2 decreases
if you drop such column from the model.
This quantitative measure, specific to Xj as used in a regression with a GIVEN set
o other variables: X− j, is called: “semi partial R2 ”.
9.12.8
Semi Partial R2
We define this as the “marginal” change of R2 due to each column in X after accounting
for the other columns.
This seems to imply that its computation is, if not difficult, quite long: run the
regression of Y on the full X, compute the corresponding R2 and then drop in turn
each single column in X, one at a time, and measure the corresponding change in R2 .
This is not only long to do, it could be impossible if we are evaluating regression
results as we read them we read on a paper or a report, as we should need to work on
the original data to perform the computations.
It is quite frequent, at least in the social sciences milieu, the case where even the
overall R2 is not even reported.
We have a way out of this which can almost always (in simple OLS setting) be
implemented.
A “folk” and simple result of OLS algebra (we give it here without proof, but see
further on for a proof, not required for the exam, in a footnote) allows us to determine
the marginal contribution of each column in X to the R2 even if we only know the
T −ratios for the single parameters and the size n of the data set.
134
Lemma 9.6. Suppose we are using OLS and we drop the column Xj from the matrix
X. The decrease of the overall R2 corresponding to the dropped column (call this
2
)is equal to t2j (1 − R2 )/(n − k). Where: t2 is the square of the T −ratio for
R2 − R−j
the added variable, n is the number of observations and k is the number of columns in
the full regressors matrix. This is called the “semi partial R2 for Xj and is nothing but
the R2 of the regression of Y on the residuals of the partial regression of Xj on X−j .
Here the T −ratio is assumed to be computed with the standard formula we gave
in the section about OLS.
An interesting point in this result is that it allows us to “recycle” a quantity which
we considered as just a measure of statistical reliability, as a useful way for reading
the R2 . This is just an algebraic result, that is: it is valid in any sample and does
not require either weak or strong OLS hypotheses to be valid. It is just a numerical
identity.
Beyond allowing us to compute the semi partialR2 , this result has other interesting
implications. As we stressed before, with a big sample size it is very difficult for
a T −ratio not to be “statistically significant” as even a very small number can be
distinguished from zero if the sampling variance is small enough (the denominator of
the T -ratio is divided by the square root of n − k). However, when n − k is big, it
is quite possible that the estimate could be “significant” while, at the same time, the
relevance of the variable could be totally negligible, as the added contribution of the
variable to the explanation of Y variance could be negligible.
Suppose you have, say, n = 10000 (not uncommon a size for a sample in social
sciences and in Finance). Suppose you have 10 columns in the X matrix and the
T −ratio for a given explanatory variable is of the order of 4 so that the P −value shall
be of the order of .01: “very” statistical significant! True, but the above lemma tells
us that the contribution of this variable to the overall R2 is at most (that is: even
for an overall R2 very near to 0) of, approximately 16/(10000) that is: less than two
thousands. Hardly relevant from any practical point of view! If I drop the variable the
overall explanatory power of the model drops of way less that 1%.
Another way to see the same point is this: how big should be the T −ratio, under the
previous hypotheses, so that the marginal contribution of the regressor to the overall
R2 is, say, 10%? From the above formula we have that the T −ratio should be of the
order of 32 (the square root of 1000 is about 31.62).
This makes even more clear the fact that, in general, “statistical significance” (that
is: the estimate is precise enough hat we are able to distinguish it from 0) and “relevance” (here measured by the contribution to he forecasting power of the model as
measured by R2 ) are very different concepts.
It is now important to understand how semi partial R2 works across different
columns of X.
If we compute this quantity for each column of X, we measure the marginal contribution of each of these column in “explaining” (frequently used but not so correct term)
135
the variance of Y . As stated above, here “marginal” means: how much the introduction
of the variable improves the R2 when the other variables are already all in the model.
This means that, each time we compute this quantity for a different column Xj , the
“other variables” left in X−j are different. For this reason, while we can split the overall
R2 in a part due to the introduction of a new variable and a part already “explained”
by the other variables, we cannot add in any meaningful way the semi partial R2 of
different variables except in the case when all the columns of X are uncorrelated (not
very interesting in our field).
Summary up to this point: a regression is a forecast which minimizes the mean
square error (if it is linear with intercept it minimizes the variance of the error. This is
the purpose of a regression and it should be evaluated in so far the quality of the forecast
is sufficient to our purpose (the specific purpose is going to enter in the evaluation).
While in a forecasting setting there is no big role for speaking of “the effect” of this
or that column of X, it is possible (partial regression theorem) to define the marginal
contribution of each column of X to the overall R2 . The quantitative measure of this
contribution is given by the semi partial R2 .
9.12.9
Further discussion about the “effect” of a variable in a forecasting
setting
When we speak of “contributions to the R2 ” we are not speaking of the contribution
of single specific values of a column of X but, obviously, we are considering the full
variance of the column.
Whoever reads a paper containing results of a linear model is, almost unavoidably,
treated with sentences of the kind “a change of 1” or “a change of one standard deviation” in the variable xj “is going to have as effect, on average, of a change of y of...”
and here, typically, you find βj times the change in xj .
As we discussed above, this is, implicitly, a “causal” or “intervention” interpretation.
Strictly speaking, in the simple “regression as forecast” context, such a statement has
no meaning.
In a regression used as a forecasting tool, we do not attribute changes of values in
the column Y to changes in this or that value of the column Xj . Our interpretation
is, first, in terms of how much of the full variance of Y “can be forecasted” using
our knowledge of all (columns and rows) of X and, second, in terms of the marginal
contribution of each column Xj .
Again, and sorry for the number of times this is repeated, but it could be useful:
this is done without any causal interpretation be or not such interpretation available.
So, is it possible to give any meaning, outside a causal interpretation, to statements
the like of: “a change of 1” or “a change of one standard deviation” in the variable xj
“is going to have as effect, on average, of a change of y of...”?
Obviously, it may be such a statement, trivially, is only intended to mean that,
136
considering E(Y |X) = Xβ as a linear function of X, the elements ofβ can be seen as
derivatives with respect to each xj .
This is trivially true and trivially useless, by itself. A derivative times the change of
a variable xj measures the change of E(Y |X) only if this change in xj happens without
altering any other value in X or the same functional form of E(Y |X) which is exactly
what we cannot guarantee without further hypotheses.
In forecasting terms this statement boils down to the obvious sentence: “if I have
two different rows of X where the only difference between the rows is in the value of
Xj then the difference between the forecasts shall be βj times the difference in the two
values of Xj ”. This is arithmetically true but not very useful.
As already stated, a real solution of this problem has to do with a “causal” o
“intervention” interpretation of a regression, that is, of the study of conditions where
I can actually interpret a βj as a measure of the “effect” of a change in one variable
“keeping the rest constant” that we can actually implement.
What we can do is to use the semi partial R2 in order to measure the specific
contribution of all the rows of Xj to the overall R2 that is: the usefulness of each
column of X in the forecast of Y .
This is useful when our purpose is limited to forecasting and, obviously, shall also
be useful when we are in the condition of performing and “intervention” analysis (which
we do not consider here).
From this, one must not infer that the “value of βˆj ” is by itself not relevant in a
forecasting setting.
Actually, it is possible to show that such value, even in a forecasting setting, has
something to do with a ratio of standard deviations in a way that echoes (but in a
correct way) the (wrong) interpretation of βˆj of the kind: “how much Y changes if I
change Xj of one standard deviation”.
Let us begin with a useful result connected to the semi partial R2 .
Consider the regressions of Y on X−j and of Y on both X−j and Xj
Y = X−j α + 0
Y = X−j β−j + Xj βj + .
Proceeding in a similar way as in the partial regression theorem, regress Xj on X−j .
Substituting Xj = X−j γ̂ + Û in the second model you get
Y = X−j (β−j + γ̂βj ) + Û βj + Since Û and X−j are uncorrelated, the estimate of β−j + γ̂βj coincides with α̂. Hence
Y = X−j α̂ + Û β̂j + ˆ
Hence,
V ar(Y ) = V ar(X−j α̂) + V ar(Û )β̂j2 + V ar(ˆ).
137
Notice, moreover, that V ar(Û ) is, by definition, equal to V ar(Xj |X−j ) (it is, in fact,
the variance of the residual in the regression of Xj on X−j ). We then have that the
overall R2 of the regression of Y on X can be written as:
R2 = (V ar(X−j α̂) + V ar(Û )β̂j2 )/V ar(Y )
As a consequence, the increment in R2 , when going from the first model to the second
one, that is: the semi partial R2 , is
2
R2 −R−j
= t2j (1−R2 )/(n−k) =
V ar(Xj ) 2
V ar(Û ) 2 V ar(Xj |X−j ) 2
2
β̂j =
β̂j = (1−RX
)
β̂ .
j X− j
V ar(Y )
V ar(Y )
V ar(Y ) j
We have, then:
q
p
ˆ |
V ar(Xj |X−j )|β
j
2
2
p
tj (1 − R )/(n − k) =
V ar(Y )
That is: the square root of the semi partial R2 for Xj is, in units of the standard
deviation of the data on Y the change in the conditional expectation of Y given by a
“reasonable” change in Xj , reasonable when the “other X−j are kept constant”, hence
measured with the CONDITIONAL standard deviation of Xj GIVEN X−j .50
50
While
V ar(Xj |X−j )β̂j2
V ar(Y )
is a “direct definition” of semi partial R2 for Xj as this is the difference
V ar(Xj |X−j )β̂ 2
j
between the R2 with the full X and the R2 with X−j , the identity
= t2j (1 − R2 )/(n − k)
V ar(Y )
rests on a lemma we did not prove.
A simple proof of this lemma (what follows is not for the exam!) is as follows and, as it should now
be not surprising, is based on the partial regression theorem. Since β̂j can be estimated by regressing Y
Pn
Pn
on the residuals of the regression of Xj on X−j we have β̂j = i=1 yi (xij − Xi−j γ̂Xj X−j )/ i=1 (xij −
Pn
Xi−j γ̂Xj X−j )2 and this can be written as β̂j = i=1 yi (xij − Xi−j γ̂Xj X−j )/(nV ar(Xj |X−j )). The
sampling variance of this is V (β̂j ) = σ2 nV ar(Xj |X−j )/(nV ar(Xj |X−j ))2 = σ2 /(nV ar(Xj |X−j )).
)/(n − k))/(nV ar(X |X )). Now
The estimate of this is V\
(β̂ ) = σ̂ 2 /(nV ar(X |X )) = (nV ar(ˆ
j
j
−j
j
−j
recall that V ar(ˆ
) = V ar(Y )(1 − R2 ) so we have that t2j (1 − R2 )/(n − k) = β̂j2 (1 − R2 )(n −
k)nV ar(Xj |X−j )/(V ar(Y )(1 − R2 )(n − k)n) = β̂j2 V ar(Xj |X−j )/V ar(Y ) and we have the proof.
All this depends on the fact that V (βˆj be estimated with the usual OLS formula. This require the
Hypothesis of uncorrelated residuals with constant variance. In case residuals are correlated, and in
particular, correlated within groups of data, a frequently used estimate for the variance of the estimate
of βj is the “clustered estimate”. This tends to be bigger than the OLS based one and, by consequence,
t-ratios tend to be smaller. A reasonable rule of thumb (see, e.g. Brent F. Moulton (1986) “Random
group effects and the precision of regression estimates”, Journal of Econometrics) gives an increase of
the estimated variance which should be smaller than n/q where q is the number of “clusters” in the
data (hence, n/q is the average size of groups). This
p implies that the “clustered” t-ratio should be
approximately equal to the Standard OLS one times n/q}. If we call tc such “clustered” t-ratio, the
above formula for the semi partial R2 should be amended to t2cj (n/q)(1 − R2 )/(n − k). However, in
this case, a direct evaluation of the semi partial R2 is always possible using two OLS regressions. The
approximation is useful if, improperly, info in semi partial R2 is not offered by Authros.
138
2
Moreover, sinceV ar(Xj |X−j ) = (1 − RX
)V ar(Xj ), we can write:
j X−j
2
)
t2j (1 − R2 )/(n − k) = (1 − RX
j X−j
V ar(Xj ) ˆ2
β
V ar(Y ) j
or, equivalently
V ar(Y ) t2j (1 − R2 )/(n − k)
= βˆj2
2
(1 − RX
)V
ar(X
)
j
j X−j
2
Notice that (1 − RX
) is frequently used as a measure of how correlations in
j X−j
the matrix X affect each Xj and is printed in regression output (usually as an option)
under the name “Tol” for “tolerance”. This is because, if it happens that the Tol for
any Xj is 0 or near 0, implying Xj linearly dependent, or almost linearly dependent,
on X−j , we have that X 0 X cannot or almost cannot be inverted due to collinearity.
The idea of a first assessment of the “relevance” of a variable in a regression by
computing “how much the conditional expected value of Y changes if we change Xj of
one unit of standard deviation” and then compare this with the empirical standard
deviation of Y , is, as we just said, common even if in general unjustified as it implies
a causal interpretation of the regression.
From the above computations we see that, there is a reasonable and correct version
of this argument, in the forecasting setting, in terms of semi partial R2 .
We can summarize this in several, equivalent, ways:
1. The square root of the semi partial R2 for a given column of Xj is the ratio
between the fraction of the standard deviation of Xj that is not correlated with
the other columns of X and the standard deviation of Y times the absolute value
of βˆj .
2. The absolute value of βˆj is the ratio between the fraction of standard deviation
of Y “explained” by Xj alone (the square root of the semi partial R2 times the
variance of Y ) and the amount of standard deviation of Xj that is uncorrelated
with the other columns of X.
3. βˆj in absolute value is the ratio between a “reasonable” movement of Y (one
sigma) and one “reasonable” movement of Xj conditional to the other columns of
X (one CONDITIONAL sigma) multiplied by the square root of the semi partial
R2 .
4. The square of βˆj is the ratio between what “is to be explained” (the variance of
Y ) and what is “left in Xj to explain Y ” (the conditional variance of Xj ) times
the fraction of the variance of Y “explained” by Xj given the other columns of X
(i.e. the semi partial R2 ).
139
9.12.10
Again: “statistically significant” VS relevant
We have offered some hint of how to interpret a regression when parameters are known.
In all standard applications, β is not known and must be estimated.
This is the “second step” in interpreting results we mentioned above.
What changes? Actually not much, the problem is only to quantify how much we
really know about β, since we can only estimate it, and it is easy to do this by the
study of its sampling variability.
Although this should be simple (and in fact it is simple) it may create some new
interpretation problems we discussed above, when we compared “statistical significance”
with “relevance”.
It is an empirical observation that, while the pitfalls and interpretation errors we
are going to summarize here are warned against in most good books of Statistics,
this common advice seems to work in some empirical field while it is almost of no
consequence in other.
It is usually the case that the effect of such warnings is bigger when empirical
analysis has real practical purpose and smaller when the main reason for empirical
analysis is more “paper publishing” oriented.
In empirical Economics and Finance the main question has to do with the concept
of “statistical significance”.
In the standard statistical jargon an estimate of a parameter is “statistical significant” if its estimated value, compared with its sampling standard deviation makes it
unlikely that in other samples the estimate may change of sign.
In the standard regression setting, the most frequently used statistical index is the
T − ratio and an estimated βj has a “significance” which is usually measured in terms
of its P −value of the T −ratio.
Does a small P −value imply that a parameter is “relevant” in any sense, except the
fact that it is estimated well enough so that its value, in other possible samples, should
not change sign?
As already discussed, the answer is “absolutely not”.
We already commented on this when considering the semi partial R2 . There is an
even more striking way to present the point: suppose the parameter is known and is
different from zero (so that its P −value is 0: it cannot be more significant than this!)
the actual relevance of the corresponding regressor could be absolutely negligible if the
semi partial R2 is small. Here, by relevance, we intend the ability of the corresponding
Xj to “explain” an amount of variance of Y which is big w.r.t. the total variance of Y
but similar statements are true if you measure “relevance” in a more complex causal
setting.
“Statistically significant” only means that the statistical quality (precision) of the
estimate is such that the estimate should not change sign if we change the sample. In
iid samples, if n is big, typically all parameters become “statistically significant”.
140
√
This because the sampling standard deviation decreases at speed n, so that even
a practically negligible βj can be estimated with enough precision so to allow us to
distinguish it from zero.
In no way this implies βj to be “relevant” in any practical sense. What happens here
is that, with n big enough, we can reliably assess that an irrelevant effect is actually
irrelevant.
It is frequent to see published papers in major journals where linear models with
tens of regressors and tens of thousands of observations result in statistically significant
coefficients with an overall R2 in the range of few percentage points and semi partial R2
of fractions of 1%. Whatever the notion of “relevance” (forecasting, always available,
or causal, requiring many hypothesis), it is difficult to conceive of any practical setting
where such results could be termed “relevant”, if not because they give relevant support
to the statements of “irrelevance” of the corresponding effects.
This would be not so important, if the same papers did not spend most of their
length discussing about the meanings and the practical relevance of the effects supposedly “found”.
This misunderstanding between “statistical significance” and “relevance” must be
avoided. If models were used for practical purposes (say for forecasting or controlling
variables) the misunderstanding would quickly disappear: an estimate can be as significant as I like but, if the R2 is small, the quality of the forecast shall be awful all
the same.
When models are only used for academic purposes (appear in published papers) the
misunderstanding may continue unscathed, sometimes with hilarious consequences.
Summary: first assess the relevance of the regression and the parameters of interest
in terms of explained variance as if parameters were known and not estimated. Then
look at the statistical stability of the results. An irrelevant parameter is still irrelevant if
it is “significant” while a parameter which could be relevant can be put under discussion
if its sampling variance is too big.
9.12.11
A golfing example
In the following example we try to determine how much of the average by tournament
gain of the most important competitors in the 2004 PGA events can be captured by a
linear regression on some ability indexes and other possibly relevant variables.
The dependent variable: AveWng, is the average gain.
The columns of X are:
a constant,
Age=the age of the player
AveDrv=the average drive length in yards
DrvAcc=the percentage of drives to the fairway
141
GrnReg=the percentage of times the player reaches the green in the “regular” number of strokes
AvePutt=the average number of putts per hole (should be less than 2)
SavePct=the percentage of saved points
Events=the number of events the player competed in
We have some expectations for the possible two way correlation between these
variables and the average winnings but, since the regression estimate measures a kind
of joint dependence, it is really possible that such expectation, while reasonable, do
not apply to regression coefficients.
In particular it is reasonable to assume that expected average money is positively
correlated with AveDrv, DrvAcc, GrnReg and SavePct, negatively on AvePutt, while
we do not have an a priori on Age and Events.
By the way: in this example no direct causal interpretation is reasonable.
Let us start with some descriptive statistics and a simple correlation matrix:
From this correlation matrix we see that at least one of our expectations are apparently not true: correlation with driving accuracy is negative. However we also see
that, and this could be expected, correlation between AveDrv and DrvAcc is rather
142
strong and negative (longer means riskier).
We’ll see that this has an interesting implication on the overall regression.
Let us now run the regression:
Do not jump to the coefficients! First read the overall R-square. It is .45 that
is: 45% of Y variance is due to its regression on X. It is important to keep this in
mind: anything we’ll further say about coefficients, partial R square etc. lies within
this percentage. More that 50% of the dependent variable variance is not “ruled” by
its regression on X.
Another way to read the same result is to compute a confidence interval for the
(point) forecasts. As seen above this interval is given by
q
h
i
0
0
−1
xf β̂OLS ± z(1−α/2) σ (1 + xf (X X) xf )
It could be shown that, for n−k not to small, X 0 X with a determinant not too near
to 0, and a xf not “too far” from the observed columns of X, this can be approximated
by
h
i
xf β̂OLS ± z(1−α/2) σ
Under the same hypotheses we can freely put σ̂ in the place of σ and still use the
Gaussian in place of the T distribution. With this approximation, the point forecast
143
interval is the same for each xf and its width is 2z(1−α) σ̂epsilon . If we allow for the
plus/minus two sigma rule this, with our data, becomes 4 times 41432 or, forecast
plus/minus 2 times 41432. If we stick to the Gaussian hypothesis (or believe a central
limit theorem can be applied in our case) this interval should contain the true value of
yf with a probability of more than 95%.
If we go back to the descriptive statistics, we see that the standard deviation of
AveWng is 54990. This means that, without regression, our forecast would be the same
for each observation and equal to the average (46548) and the corresponding forecast
interval would be 46548 plus/minus 2 times 54990. With the regression our forecast
is xf β̂OLS , so it varies with the observations on xf , and this variability “captured” by
the regression, is “subtracted” from the marginal standard deviation so that the point
forecast interval shall be narrower: the point forecast plus/minus 2 times 41432.
You should notice that, with an R2 of about .45, the width of the forecast interval
is reduced only of less that 1/6th. This is not surprising: the R2 is in terms of variance
while the interval is in term of standard deviations. Variances (explained and unexplained by the regression) sum, standard deviations do not (the square root of a sum
is not the sum of the square roots). For this reason the term “subtracted” above is put
under quotes.
We may then question the statistical precision of our estimates, in particular the
statistical precision of our R2 estimate. In the output we do not have a specific test
for this but we have something which is largely equivalent: The F −test tables.
The F −tables imply rejection of the null hypothesis that there is no regression
effect, meaning: all parameters are jointly equal to zero with the possible exception of
the intercept.
Notice that, with few observation, even a sizable R2 as what we found could fully
be due to randomness. The F −test tells us that this does not seem to be the case.
This is not a direct “evaluation” of the statistical precision of our R2 estimate. However,
implicitly, since there exist a direct link between the value of the F −test and R2 , it tells
us that an estimate of R2 as the one we found is very unlikely, if there is no regression
effect.
From the point of view of forecasting, this is all. We may like or not the results but
this is what we find in the data and, if we just suppose some “stability” of the model
(see the comments above) this is the precision of the forecast we can make.
What follows can be seen as an “anatomy” of the forecast in terms of each column
of X. This can be useful for forecasting use but, obviously, it is much more relevant if
the setting is such that we are able to hold a causal interpretation of the regression.
If we go to the last column of the regression output (we added this to the standard
Excel output) we find the semi partial R squares. We see that only three variables have
a sizable marginal contribution on R2 as measured by their semi partial R2 : GrnReg,
AvePutt and Events. This means that these are the variables whose addition to X
most improves the forecast. Can we go a little bit further and say that, barring for the
144
Events variable we shall comment further on, an increase of GrnReg and a decrease
of AvePutt are the aspects of the game that, if improved, would imply a greater and
more reliable increase in AveWng?
This is a causal interpretation, is it reasonable in our setting? We cannot exclude
this, we can only say it is very unlikely to hold.
Why? The data is the summary of a season. It describes a set of “ability” indicators
for each player and some other variable.
Let us concentrate on the abilities.
Let us take, for instance, Age. This is a typical variable you cannot intervene on.
This not withstanding, the variable changes in time. The possible causal interpretation would then be: each year the conditional expected values of gains goes down by
almost 600 dollars. Is this the “effect” of age?
Even if we do not consider that what we observe is a cross section of players and not
a time series of results for a single player (so that we may observe the action of Age),
we must answer “beware”. If a causal interpretation was possible, the β of Age times a
change of Age would be the expected change in AveWng if all the other variables are
constant, that is: if only Age acts and the golfer’s abilities as expressed by the other
variables do not change.
Is it reasonable that the natural evolution of Age does not change the other abilities
of any given player?
Quite unlikely. In any case this point should be assessed by theory and empirical
data in order for a causal interpretation to be possible (the methods for doing this are
not object of this course).
Let us now consider a variable on which we can think we could “act”: AvePutt.
We cannot arbitrarily set this to a lower number (compare this with “changing
interest rates”) but we may conceive of increasing the time dedicated to putting green
training. If this reduces the number of putts, even of just 1/100 we should improve
(change of the conditional expected value) our wins “on average” of almost 700 dollars
(69000 times 0.01).
Is this the “effect” we can expect? It depends. Golfing is an equilibrium game.
What counts is the overall result and trying to improve a part of the game may have
bad results on other parts of the game.
By training more on the green maybe we worsen (or maybe improve?) our game
under other points of view: length, precision from distance etc.
Moreover: the model was estimated on a sample of players with a given “equilibrium
mix” of abilities.
Is it still going to be valid if we alter such characteristics? Again: we do not
know this and, with no answer, any attempt to use the model in this sense would be
unwarranted.
Notice that here we did hint to three different problems, the same problems we
hinted at a number of times above.
145
The first is that it can be difficult or impossible to act on an Xj and, at the opposite,
some Xj is bound to change by itself .
The second is that it may be difficult to intervene on one Xj without altering other
Xj -s and, if this happens, we should model this interaction to have an idea about the
“effect” on the dependent variable.
The third is that any action on one or more Xj could alter the conditional expectation itself and we should model this alteration.
All these problems have been discussed and are being discussed by econometricians.
In fact, as we mentioned, these problems are at the origin of Econometrics and are still
the central problem of Econometrics: what makes Econometrics sister but not a twin
sister of Statistics.
Following the general approach of this section we do not further develop develop
the “causal” discussion and, for a moment, suppose that we can improve our AvePutt
decreasing it without altering other variables or the regression function itself.
Is it reasonable to assume an improvement of 1/100 if we suppose this does not alter
the other indicators? Since we see that AvePutt is correlated with the other variables,
this cannot be more than an approximation. However, if this correlation is not too big,
it may be that the reasonable values of AvePutt, conditional to the other variable to be
constant, have a standard deviation which is big enough so that it allows for “changes”
of 1/100 in AvePut.
The marginal standard deviation of AvePutt is about .023. Notice that this is a
standard deviation across the players, so it does not directly concern our problem.
1/100 is less than one half of the marginal standard deviation of AvePutt. This
means that is quite easy to find different players with such a difference in this statistic.
With a little bit of unwarranted logic, let us assume that this is true for the single
player, if we do not condition to his other statistics.
This is the crucial point: both for different players and for the single player we
must recall that we are within a regression and that we are evaluation the possibility
of changing AvePutt of 1/100 while the other variables do not change.
This means that, as stated above, we must consider the conditional standard deviation not the marginal standard deviation.
Recall the formula
p
q
ˆ |
V ar(Xj |X−j )|β
j
p
t2j (1 − R2 )/(n − k) =
V ar(Y )
Using the data in the output we find that the standard deviation of AvePutt (our
Xj ) conditional on the other variables (X−j ) is .021, obviously smaller that the unconditional variance but still more that 2 times the hypothesized change of .01.
This implies that, even conditionally to the other variables, different players (and
maybe the same player) still could easily show such different values of AvePutt.
146
For the above mentioned reasons, this does not justify, by itself, a causal interpretation. However if such an interpretation were available, an expected effect of the size
of 700 dollars (69000 times .01) or even more would not be unreasonable.
On the other hand an improvement of, say .04 in AvePutt would probably be
unlikely both marginally and, what is more important for us, conditionally to given
values of the other X−j .
If this causal analysis is viable, then, we may expect that a work on the putting
green which does not alter the rest of “the game” could give a golfer a reasonable
improvement of 700 dollars in the AveWng (roughly 1.5% of AveWng).
Let us now consider other aspects of the estimates.
A possible puzzling point is given by the sign of AveDrv and DrvAcc which are
both negative.
The semi partial R2 of AveDrv is almost 0 while that of DrvAcc is a little more than
2%.
In most practical contexts we could then avoid discussing the estimate of the parameters for these variables. As an exercise, however let us try to use what we know
about partial regressions to unravel the puzzle.
Begin by comparing the simple correlations with AveWng and the signs of the
parameters estimated in the linear regression. Notice that the sign of the parameter
for DrvAcc is the same as its correlation with the dependent variable while the sign of
AveDrv is negative with a positive correlation.
A negative simple correlation between AveWng and DrvAcc may not be surprising
and we may try an explanation, which, as always in these cases, is implicitly based on
some causal interpretation to the parameters.
The possible interpretation is this: it could simply be that, in order to be precise
with the drive, a player tends to be too cautious and this may harm his overall result.
There are many alternatives to this interpretation, each depending on some strand
of causal reading of the parameters. The choice among these depends on further and
more complex analysis and on more structured hypotheses about how the performance
of the golfer is connected to each of the statistics. On the plus side for a forecasting
interpretation.
Now let us consider AveDrv, the correlation of this variable with AveWng is positive
and not small, while the regression coefficient estimate is negative and the semi partial
R2 is virtually 0 (much smaller that the same statistic for DrvAcc whose correlation
with AveWng, in absolute value, was roughly 1/2 of that of AveDrv).
To understand what is happening let us consider the result Here we see the result
of the regression of AveDrv on the other columns of the X matrix:
147
60% of the variance of AveDrv is captured by its regression on the other columns
of X More that one half of this (38% semi partial R2 ) has to do with its negative
dependence on DrvAcc.
Also GrnReg show a sizable semi partial R2 (14%) and a positive regressive dependence.
As we know, only what is left as residual of this regression is involved in the estimation of AveDrv parameter in the original regression. This is the part of AveDrv
variance which is not correlated with DrvAcc and GrnReg (and the other variables in
the partial regression).
We know that GrnReg is the single most important variable in the overall regression
(in the sense that it shows the highest semi partial R2 ).
Based on this we may attempt an interpretation (again: many are possible): the
“equilibrium player” represented by the regression tends to have an higher AveWng if the
percentage in GrnReg is higher. On the other hand, a higher percentage of GrnReg
tends to imply a bigger AveDrv. For this reason, marginally, AveDrv is positively
correlated with AveWng. However the AveDrv in excess of what correlated with GrnReg
seems to be harmful to the overall game and, from this, the negative coefficient in the
overall model.
Now compute, as we did above, the conditional
standard deviation of AveDrv.
√
According to our formula this is equal to 0.0000812 ∗ 54990/94.76 = 5.23 to be
compared with a marginal standard deviation of 8.27. If we hypothesize a change in
this variable (conditional to the other columns of X) equivalent to that hypothesized
148
above for AvePut (less that 1/2 of its conditional standard deviations) equal to 2.5 the
overall expected “effect” should be a decrease of AveWng of roughly 200 dollars. You
would need a very big change of twice conditional standard deviation (about 10) to
have a negative effect comparable with an AvePut change of 1/2 conditional standard
deviation.
Again a matter of care: this evaluation are borderline causal!
In the end, what would be the most proper use of such a regression?
Suppose you want to bet on how much on average a randomly chosen player is going
to win. You know the characteristics of the player, you are betting on the results.
The estimated regression would be a nice starting point.
Now change players into stocks, winnings into returns and use market returns, price
to book value, size, and so on as indicators as in the Fama and French model or in the
style analysis model. In which stock would you invest? To which fund manager would
you give your money?
This are clearly relevant questions and the regression model would be fit for these
even without any causal interpretation.
9.12.12
Big/small partial R2 and relevance
A last relevant consideration: as we have seen in order for a variable to “explain” a
big chunk of the dependent variable variance it is necessary (not sufficient) that this
variable has some variance left when regressed on the other explanatory variables.
We also did say that this is a rather generic statement and that a more precise
analysis should be led case by case.
Now we must also stress that the analysis considered here considers as “given” the
joint distribution of X and, by consequence, the conditional distribution of each column
of X given the other columns.
In this setting, an analysis of the “relevance” of a variable in a regression based on the
semi partial R2 is well justified. This seems the most relevant case in an observational
setting as the setting most common in applications in Economics and Finance.
Suppose, on the other hand, that there is the possibility that, keeping constant
the regression function E(Y |X) = Xβ, the joint distribution of Y and X may change
(perhaps because it is “acted on” by some policy decision or, simply because of any
new circumstances). In this case, in general, the overall R2 and each semi partial R2
would, in general, change.
For simplicity, just consider the “univariate” model yi = α + βxi + i . Under the
hypothesis E(yi |xi ) = α + xi β. In this case the R2 and the semi partial R2 of x are the
same and are R2 = β 2 V (x)/(β 2 V (x) + v()).
To fix the ideas, suppose β = .5 V (x) = 1 and V () = 10. In this case R2 =
.25/10.25 = 0.024. Whatever the interpretation of the regression (forecast or causal)
it seems that the role of x, while existing (we know that β is not 0), is not so relevant
149
(at least in terms of a “good fit”).
But suppose the, with no change the regression, either the variance of x becomes
higher or the variance of decreases or both, that is: suppose the joint distribution of
y and x changes, for some reason51 . For instance, suppose V (x) = 100. If all the rest
remains unchanged the new R2 shall be equal to 25/35=0.72.
In this case the contribution of x to the quality of the forecast of y becomes considerable.
Is this reasonable, is it relevant? For instance: in a observational setting, can we
suppose that the data we use for the forecast are so different w.r.t. those used for the
estimation? In a causal setting: is it possible such a big alteration of the behaviour of
x (and in a multivariate regression: is such a change possible CONDITIONAL to the
other columns of X)?
This, obviously, cannot be assessed in general and can only be evaluated on a case
by case basis.
The important point is, again, to fully understand how, even a simple and standard
method like linear regression can never be dealt with in a ritual/cookbook way. Only
a full understanding of the method and of the circumstances of its specific application
can (and does) yield useful results. Barred this, its use can only be understood as
kowtow to pseudo scientific ritualism or, worst, mislead rhetoric.
Let us consider a case in which a variable may have a “relevant effect” even if it
does NOT explain a big chunk of the dependent variable variance.
Suppose for instance that you have a dataset where observations ore on the heights
of a population of adult men and women. The sample is very unbalanced and it
contains, say, 1000 men and 20 women. For this reason most of the observed variance
in height shall be due to variance across men. If we regress heights on a constant
and a dummy which is equal to 1 of the subject is a woman we shall find, with all
likelihood, a statistical significant negative parameter for the dummy (something like
-10 centimeters) but an almost zero r R2 . This does not mean that the difference in
height between men and women is irrelevant, it is, but that, due to the fact that most of
the sample is made of men, this difference does not explain a big chunk of the variance
of THIS sample: most of the variance in this sample is not due to sex, but to variance
in height among males.
Now, suppose you apply this result to a balanced sample where 50% of the subjects
are women and 50% are men. In this new sample most of the variance shall come from
the different sex. In other words: we do forecast in a setting where the distribution of
X is quite different w.r.t. that valid for the estimation sample
More in general: it may be that the role of a variable, in a forecast or, if reasonable,
in causal terms, is “big” while its partial R2 evaluated in the estimation sample is
small. If this happens this is usually due to the fact that, conditional on the other
51
It is the same to say that the joint distribution of x and changes.
150
explanatory variables (and maybe even unconditionally) this variable varies very little
in the estimation sample and does not determine a relevant part of the dependent
variable variance.
It may be that, for some reasons, the observed sample to be unbalanced with respect to the population. If, in a more balanced sample, the explanatory variable we are
considering is expected to have higher variance, it may be that its contribution to explaining the variance of the dependent variable increases so that it becomes interesting
to study its behaviour. However if this is not the case and the sample is representative of the population we are interested in, the “relevant” parameter shall be interesting
only if we compare the (few) sample points where the values of the explanatory variable
present very different values.
A second very simple example: suppose you are interested in the expected life of a
sample of patients after a given medical treatment. A small subsample of patients was
using a given drug, say A. A new drug, B, is given to all the subjects in the sample
and you observe a huge variation in mean survival time across different subjects, say a
standard deviation of 10 years over a mean of 5. You also observe that the subsample
previously treated with A has the same standard deviation but a mean of 10. Since
this subsample is small, the difference between the means shall contribute very little to
the overall variance (the partial R2 shall be small) however it would be very proper to
suggest the use of A joint with B. Notice that the more you increase the subpopulation
which is using A the more the explained variance due to use/no use of A shall increase.
This, however, is true only up to the point when the fraction of sample using A is 1/2.
If for instance, everybody shall use A, there will be non “variation” of life span due to
use/non use of A, but there still be the “effect” of A in the 5 years on average gained
by its use.
Notice that in this example the reasoning is based on our ability to change the
proportion of population using A. Suppose instead A not to be the use of a medicine
but the fact that your eyes are one blue and one brown. In this case, observing the
same results, we would have very little to suggest except the fact that B seems a very
useful medicine for the few lucky (in this case) people with eyes of different colors.
So: beware of unbalanced samples.
In other settings it may be that we can purposely alter the behaviour of some x not
just in terms of level but also of variance.
The number of possible cases is huge and here is not the place to go further in this.
Some last comment: in the example above we have a case where an irrelevant result
in terms of R2 gives us the relevant suggestion that we could assign both drugs A and B
to all patients, hence, alter the distribution of X. This is a real possibility, we can give
both drugs (at least, if their combination is not harmful) and from this the relevance
of the result. Suppose instead that the difference is in terms of other characteristic,
say: the color of eyes. In this case we cannot change the percentage of the population
with such characteristics, ore “give both colors” to each element of the population. In
151
this case, while interesting, the result is in any case “irrelevant”.
Since all estimates and statistics could be identical in both cases, this implies that
“relevance” is not something that can be fully resolved only on the basis of Statistics:
it requires accurate analysis of the specific problem.
It is also easy to show examples where a big partial R2 is, in practice, while important
in forecasting terms and maybe also in “causal” terms, not directly of any use (beyond
forecasting). Suppose we select a population of women of different ages, according
to the marginal distribution by age of women, and attribute to each individual the
number of children she gave birth to in the last 5 years. It is clear that the age shall be
relevant (in terms of partial R2 ) in “explaining” the variance of the dependent variable.
This is expected and cannot be of use as we cannot change the age of the elements in
the sample. However, it is going to be important to keep the variable in the regression
if we wish to assess the separate effect of other, less obvious but potentially relevant
variables on which we can act, as, for instance, the amounts of vitamins in the blood
of different subjects, when these variables are correlated with age.
Points to remember in reading a regression.
To conclude this section let us summarize the steps in reading regression results.
Before beginning remember: it may not be necessary for you to discuss “effects”
of variables. This is really relevant only if you intend to use the model for a “policy”
action. If your purpose is data summary or forecasting “effects” (in the policy sense”)
are not the relevant aspect of regression to be studied.
On the other hand, if “effects” are of interest for you, regression by itself shall not
be able to evaluate these by itself and you’ll need accessory hypotheses in order to be
able to assess these.
Among the possible accessory hypotheses, an experimental setting may sometimes
be useful (if possible).
A “structural” approach is another possibility and this is, from the historical point
of view, the approach chosen by Econometrics.
Both these are not covered in this introductory course.
This said, here is the list:
1. Divide the analysis: A) known parameters B) statistical estimate. B) is easy (if
you know your Statistics) A) is the tricky part.
2. Understand that your model is a model of a conditional expectation E(Y |X) =
Xβ
3. Understand that Y is NOT E(Y |X) but Y = E(Y |X) + ε and V ar(Y ) =
V ar(E(Y |X)) + V ar(ε)
152
4. Quantify V ar(Y ) due to V ar(E(Y |X)) and V ar(Y ) due to V ar(ε) that is: compute R2
5. You MUST do this because the purpose of a linear OLS model (with intercept)
is that of maximizing R2
6. Moreover, when you discuss the meaning of each βj you are “partitioning” R2 .
7. Understand that the βj of a given Xj can be computed in two ways (partial
regression theorem): A) from the overall regression B) first regressing Xj on the
other columns of X and then regressing Y on the residuals of this regression
8. Hence βj only pertains to the “effect” on E(Y |X) of what in Xj can change
conditional the other X being constant NOT to the effect of a generic change in
Xj .
9. Be careful about the meaning of “effect” strictly speaking all you can say is that,
if you build forecasts for Y given X using E(Y |X) the said “effect” is that, if it
happens that “Nature” gives you two new vectors of observations on X where the
difference between the two vectors is just in a different value of only Xj , then
the difference between your forecasts is given by the difference in the two values
of Xj times the corresponding βj (or its estimate if you are using an estimated
conditional expectation). In other words: this tells you nothing, by itself, about
the possible change in Y given and act on your part to change some value of Xj .
The (by no means easy) study of such a “causal” interpretation has always been
very much in the mind of econometricians who evolved structural Econometrics
as an attempt to answer the (very interesting due to obvious policy reasons) of
assessing the possible results of a change in a variable not just given by “Nature”
but acted by a policy maker. The obvious difference between the two cases is
that the act could not respect the “Natural” joint distribution of observables as
make your previous study of this useless as a source of answer. Just think to the
obvious difference in observing, say, interest rates changes induced by the market
dynamics and imposing by policy an interest rate change: the laws concerning
the effects of the second act could be completely different to the laws concerning
the “natural” evolution of rates in the market. On the contrary the “forecast
change” is always a good interpretation if the observed change happens without
interference.
10. Once you understand the meaning of the “effect” word, quantify this, in your
sample (that is: for a given joint distribution of Y and X), with the semi partial
t2 (1−R2 )
R2 due to the j − th regressor as j(n−k) (if you are reading a paper and, bad
sign,R2 is not available, use the same formula with R2 = 0. This shall give you
an overvaluation of the partial R2 )
153
11. Evaluate the practical significance of this “effect” (while doing this ask yourself
if the sample is balanced with respect the explanatory variable and, if not, consider if a balanced version of it is a sensible possibility: see above the examples
involving heights of Males and females and the experiment with two medicines).
In general this depends on the specific case. However a first rough idea can be
gained by computing the change in the conditional expectation of Y induced by a
“reasonable” change of Xj . Since this must be a “reasonable” change “which leaves
the other explanatory variables unchanged” (recall the meaning of βj induced by
the partial regression theorem) this could be measured by the conditional standard deviation
other regressors. It could be useful, so, to compute
pof Xj given the p
the ratio |βj | V ar(Xj |X−j )/ V ar(Y ) which express this “effect” in unit of the
standard deviation of Y (the modulus around βj comes from the fact that we get
the formula taking the square root of a square). A quick proof shows you that
this is exactly identical to the square root of the semi partial R2 for Xj and this
confirms the centrality of this quantity. Beware! Do p
not be deceived
p by a quantity
which bears some resemblance to this. This is |βj | V ar(Xj )/ V ar(Y ) which
shall obviously be bigger (actually not smaller) and so, maybe, gratifying. The
point is that, by not conditioning the variance of Xj it violates the interpretation
of the coefficient. By the way: it may well happen that this quantity be bigger
than 1, which, obviously, is absurd52 .
12. Then do Statistics (namely: consider that you must estimate β and evaluate the
quality of the estimate).
13. Remember: an estimate is “statistically significant” if the ratio of its value to its
sampling standard deviation is big enough to say that you can reliably distinguish
Most of times, this choice is made when the correct measure of relevance would give as result a very
small value, that is: the practical irrelevance of the “effect”. In this case the use of the unconditional
standard deviation inflates the result but, since the starting point is very small, the inflated value
is smaller that 1 and, apparently, you do not get absurd results. The inconsistency is in any case
evident: for instance, you get a semi partial R2 of, say, .0001 for a given Xj and then you find written
in the paper that “a change of Xj equal to its standard deviation implies a change of (the conditional
expected value of - but sometimes this too is forgotten)Y equal to 1/2 of its standard deviation”. These
two informations are evidently conflicting and the solution is that, since the “effect” measure given by
βj only has to do with “a change in Xj with the rest of the regressors constant” you cannot use the
unconditional standard deviation of Xj as a measure of a “normal” change in Xj (it could be, but in an
unconditional setting) you must use the conditional standard deviation. Just take the square root of
the semi partial R2 and you get that the correct measure of the change in the conditional expectation
of Y given a “reasonable conditional on X−j ” change of Xj given by the conditional standard deviation
of this is, in unit of the standard deviation of Y , given by .01. A completely different picture. What
is happening: just take the ratio of this with the previous number .01/(1/2)=.02 this shall be the
ratio between the unconditional and conditional standard deviation of X_j. It happens that today’s
common use of very big samples which can be used only adding a huge number of “fixed effects” makes
this event quite common.
52
154
it from zero.
14. Remember: a “statistically significant” estimate could well be practically irrelevant if it corresponds to a small semi partial R2 . Moreover: with enough data
you can get any precision you like out of an estimate and distinguish the estimate
from zero even if its value is almost exactly zero. If you just play with the sample
size you shall see that the semi partial R2 is very little affected by the sample size
when this changes from good, to huge and maybe to amazing. However, you are
free to use the above suggested (correct) statistical measure of practical relevance
but ... what is “small” and “big” in terms of practical relevance, depends on the
specific purpose of analysis non on Statistics. In a good empirical analysis the
researcher should pre specify which size of an effect is practically relevant and
which precision is required for its estimate. This would allow the researcher to
choose (when practically possible) a size of the sample big enough in order to be
able to give estimates of the parameters precise enough to assess the size of the
effect to the required precision53 .
15. In any case remember the difference between significance and relevance. E.g.:
beware the use of large datasets. If, in the comments to their results, Authors
using large datasets stick too much to “statistical significance” and do not deal
with practical relevance, most likely it is the case that their results can be summarized as “a very precise estimate of irrelevant effects”, so that the reading of
the main results of the paper can be usually changed into something like: “our
data points strongly to the irrelevance of the effect under study”. By the way:
while not currently fashionable, such a finding could be of great interest.
16. Finally: Beware of unbalanced samples (this is the same as 11 but it is very
important, so I repeat).
9.12.13
Envoy: an example of important points we left out of our analysis.
More on coefficient interpretation: are “changes in X” reasonable?
We have seen that , from the statistical point of view, it may be difficult to “change Xj
independently on the other X-s” because the columns in X could be strongly correlated
so that, once we “keep the other X-s constant”, Xj may change of a very limited amount.
Let us see this from another point of view.
53
In Finance we have a very good example of this. Factor models break down the overall variance of
a return in components correlated with “factors”. The most relevant one is the “market level” but, over
time, this has been supplemented by other factors like “value”, “size”, “momentum” etc. These new
factors “explain” very tiny fraction of the variance of return, the more so if compared to the market
factor. However their study is not irrelevant because you could in practice invest in a portfolio whose
only (systematic) change in returns is correlated with just the chosen “factor”. In other words “even a
small variance is relevant if you can isolate it from the overall variance”.
155
Each observation on the basis of which we estimated the linear model had a given
vector of values for the various Xj . If we change of any amount one of these values with
all probability we shall define a vector which does not correspond to ANY observed
data. Is this new vector a possible combination of values for X?
The question is by no means irrelevant.
Suppose you are studying, say, the relationship between the price per kilogram of
several cakes and the percentage of different ingredients. You may find that a change of,
say 1 percentage point, in a given percentage of a given ingredient, say flour, “keeping
the others constant” (and here it is a good exercise to understand the meaning of this
as, since the sum of all percentages is 1 it is difficult to change one and keep the other
constant) increases the price per kilogram of, say, one dollar.
Now, the question is: if I increase the percentage of flour of 1 percent I increase
the price of one dollar, and maybe of 2 for 2 percent, but .... wait a moment,... any
cook knows that such a change is not going to give you a more costly cake, but no cake
at all! Maybe you can do something similar for very small changes in the ingredients,
however in this case the result is not going to be so interesting.
The question is: the new combination of values for “X” I am hypothesizing does
still correspond to a cake?
In a sense each recipe is an equilibrium point: the right combination of ingredients
(and not only this) yield the good cake. Am I sure that, by modifying one or more
ingredients I still get a cake and not a mess?
Let us do a small step further and be more Economists.
We are interested in production functions and we are regressing, say, the log of the
output (in some unit, which?) of different production plants onto the logs of different
inputs.
Again, supposing we understand the statistical meaning of the estimated coefficients, if we apply such coefficients to a different set of inputs the model gives us a
“forecasted” value of output however, before discussing this value, is the combination
we are considering a viable combination, and equilibrium combination. We know a priori, by observation, that those combinations of input values which are in X are viable:
they PRODUCED an output! But is this still true of the new one?
A last step and then we conclude.
In many macro models, data are for several variables in different countries and,
sometimes, in different periods of time. Each observed data corresponds to some “equilibrium” state as, after all, data WAS observed. Is this true for ANY combination of
data we may be interested to use for getting forecasts or should, a priori, be able to
show that such combination IS viable?
In recent years it has become popular to regress some economic variable, e.g. the
growth rate of GNP or the change in size of the debt or similar, on a cross section
of countries with data describing some sociopolitical characteristic of such countries
(coming as a rule from large scale surveys). Forget for a moment the fact that such
156
datasets could be heterogeneous and could violate basic OLS hypotheses. Suppose you
get an estimate of the OLS coefficients.
How can you read such estimates, what is the use of the results? Do you believe
that you can change one or more the characteristics of a country and still get a new
viable country which satisfies the regression equation and for which the forecasted value
of the dependent variable is a possible output? Provided that you really may change
such characteristics, the likely result is either that you destroy an equilibrium and, if
a new equilibrium is reached, it shall be completely different than the old one.
So beware: a linear regression model is useful to describe the “output” corresponding
to different sets of “inputs” but any reading of its results in terms of “change of Y to
change of X” is strongly dependent on the hypothesis that any “test” combination of
X-s you are interested is is a “viable” or “equilibrium” combination.
It should be clear enough that to assess the validity of such a proposition is by far
more difficult than to simply estimate the model.
9.12.14
Further readings (Absolutely Not Required for the exam)
Probably the best book on linear regression, from the point of view of interpreting
results, with strong, detailed, statements about the difference between forecasting and
causal analysis, lots of examples and hindsight and with a minimum of Mathematics
(probably because it was written by very good mathematicians), is:
Mosteller, F. and Tukey, J. W. (1977). “Data Analysis and Regression: A Second
Course in Statistics”. Addison-Wesley, Reading, MA. In particular, see ch. 13 with the
meaningful title: “Woes of Regression Coefficients”.
A good and more concise summary can be found in:
Sanford Weisberg (2014) “Applied Linear Regression”, III ed., Wiley. In particular,
see Ch. 4.
A short paper by a great statistician which contains, in simple and condensed form,
most of what was discussed here, is: George E. P. Box (1966) “Use and Abuse of
Regression”, Technometrics, Vol. 8, No. 4 (Nov., 1966), pp. 625-629
For the maths of semi partial R2 joint with a keen discussion of “effect sizes” you
may see:
Jacob Cohen e.a. ( 2013) “Applied Multiple Regression/Correlation Analysis for
the Behavioral Sciences” (English Edition), III ed, Routledge.
To those interested in reading something more about the different interpretations
of a linear model (e.g. forecast VS causal) which make, arguably, a very tricky and
slippery field to walk on, the following books could be useful:
J. D. Angrist and J. S. Piscke (2009) “Mostly Harmless Econometrics”, Princeton
University Press.
J. Pearl e.a. (2016) “Causal Inference in Statistics: a Primer”, Wiley.
157
9.12.15
Examples from the literature: interpreting VS advertising a regression
We report here three examples from published papers in main journals which could be
useful both as examples in correct and wrong reading of simple regressions.
These are not Finance papers as I had to limit myself to standard OLS examples
and these are not so frequent in recent literature (regression models are used a lot but,
due to the specific settings, the estimation method is not simple OLS so that the above
results are no more STRICTLY valid).
There is a obvious sample selection bias in this choice: I chose papers where the
readings of regressions results, while rather standard, left many points to be discussed.
The comments below are, and are to be intended as, limited to the reading of
specific regression results and do not extend to the full paper. It is frequently the case
that a paper contains very interesting ideas even when such ideas are, in the specific
instance, really not upheld by the empirical analysis presented in the paper itself.
It is not a joke to say that, at least in Economics and Finance, it can be very difficult
to find good empirical grounds for assumptions so reasonable as to be “necessarily” true.
Sometimes this would induce the best researcher in doubting the data and not the
hypothesis.
In practice, sometimes this induces even the best researcher into a “creative” use of
Statistics so to make “statistically relevant” what SHOULD, a priori, be relevant.
The first example is drawn from “Distributive Politics and Economic Growth”, Alberto Alesina and Dani Rodrik, The Quarterly Journal of Economics, Vol. 109, No. 2,
(May, 1994), pp. 465-490.
The regressions we consider have as dependent variable the average per capita
growth between 1960 and 1985.
The purpose of the models is to measure the dependence of growth from the initial
value of the Gini index.
The explanatory variables are: per capita GDP, Primary school enrollment ratio,
Gini index for the concentration of the income distribution and Gini index for the
concentration of land ownership distribution. A dummy (0-1) variable for democracy
is included and the product of this times the Gini/land coefficient is considered.
Regressions are run on several subsections of a sample but we do not comment on
this, moreover we only consider OLS estimates.
158
Quoting from the paper:
“The results indicate that income inequality is negatively correlated with subsequent
growth. When either one of the two Gini’s is entered on its own, the relevant coefficient
is almost uniformly statistically significant at the 5 percent level or better and has the
expected (negative) sign. The only exception is the OLS regression for the large sample
(column (3)), where the income Gini is statistically significant only at the 10 percent
level. We also note that the t-statistics for the land Gini are remarkably high (above
4), as are the R2 ’s for the regressions that include the land Gini’s. When the land
and income Gini’s are entered together, the former remains significant at the 1 percent
level, while the latter is significant only at the 10 percent level (the sample size shrinks
to 41 countries in this case, since many countries have only one of the two indicators).
159
The estimated coefficients imply that an increase in, say, the land Gini coefficient by
one standard deviation (an increase of 0.16 in the Gini index) would lead to a reduction
in growth of 0.8 percentage points per year.”
Let us comment this.
“Income inequality is negatively correlated with subsequent growth”: we (as the
Authors) see that in the regressions with either the income Gini or the land Gini
coefficient, but not both, the T ratio of the included variable estimate is “statistically
significant” with a P-value smaller that .05. However, when both variables are included
together (7) and (8) only the land Gini coefficient is significant.
Using the partial regression theorem, we can say that the two Gini’s coefficients
series are correlated, but the coefficient which has an “effect” (in forecasting terms) on
growth is NOT the Gini coefficients of income, but the Gini coefficient of land distribution. If we drop the land distribution Gini, the income distribution Gini becomes
relevant as a proxy (due to the correlation) of the land distribution Gini.
So, variation of income distribution, when not correlated with variation in land
distribution, has no relevant correlation with growth.
Hence the above quoted paragraph should begin with: “The results indicate that
land distribution inequality, and not income inequality per se, is negatively correlated
with subsequent growth” or, better, “The results indicate that land distribution inequality is negatively correlated with subsequent growth. Income inequality is not
related with subsequent economic growth EXCEPT in that part which is explained
by unequal land distribution” (for this reason: when both variables are in the model,
the land distribution Gini prevails). In fact, this could add interest to the Authors’
conclusions.
Now about the “size” of the effect (see the general discussion above).
“The estimated coefficients imply that an increase in, say, the land Gini coefficient
by one standard deviation (an increase of 0.16 in the Gini index) would lead to a
reduction in growth of 0.8 percentage points per year.”.
This is a (rather frequent) incorrect reading of the meaning of a regression coefficient. We already commented about this point. It is incorrect for several reasons.
The regression has to do with conditional expected values of the dependent variable
not with observed values. The two would be similar if the regression R2 were near
one, which is not the case here. In general the “full” change in Y is the change in
the conditional expectation PLUS a random error (which, in this example, and if we
suppose we are in model 5, has about the same variance of the conditional expectation
as the R2 is of the order of .53).
In short: we may speak of “reduction of the conditional expectation of growth” not
of “reduction in growth”.
Each coefficient has only to do with the “effect” (again: this has only to do with
a change of forecast not any causal interpretation. We have glossed so much on this
word that we should be able to use it here as a shorthand while avoiding wrong ideas),
160
on the conditional expectation of the dependent variable, of a change of, in this case,
the Gini land coefficient “keeping all the other variables constant”.
If I want to express the effect wrt nσ changes in the Xj I must consider the conditional σ for Xj GIVEN the other columns of the X matrix. In the above example: we
do not know (data are not provided) the correlation of the Gini land index with the
other independent variables but it is quite clear that this is not zero. If we suppose
the estimate to come from model (5) the estimated parameter value is -5.50 and the
T −ratio -5.24 computing the semi partial R2 with n = 46 and k = 4 and an overall
R2 = .53 we have that the amount of the overall R2 due to the Gini land variable is
5.242 (1 − .53)/42 = .3.
The square root of .3 is about .55. According to what we know about the semi
partial R2 , this means that a change of one conditional sigma of the Gini land times
the estimated parameter, divided by the standard deviation of the dependent variable,
is equal to .55. In order to get results in terms of values of the dependent variable we
need data on the standard deviation of this which are not available.
A final question: even if we accept the interpretation of the paper, does this mean
that if we act reducing land distributions inequality we should get an increase of
growth?
Notice that the paper does not explicitly say that by “increase” it is intended some
change due to policy actions, revolutions, or any act which alters the “equilibrium”
expressed by the dataset.
In the conclusions of the paper, however, this seems to be the idea of the Authors.
In fact their idea seem to be that what is observed is not an equilibrium at all:
“The basic message of our model is that there will be a strong demand for redistribution in societies where a large section of the population does not have access to
the productive resources of the economy. Such conflict over distribution will generally
harm growth. Our empirical results are supportive of these hypotheses: they indicate
that inequality in income and land distribution is negatively associated with subsequent
growth”.
Whatever be the connection of such a statement whith the empirical analysis present
in the paper, it may be useful to remind that the paper presents an observational
study. Hence, the reading of the results made in the paper is compatible, under some
stationarity hypothesis, with a forecasting use: we are measuring how much forecasts
of growth differ for countries with different Gini coefficients of, say, land and identical
values of other variables. Doing this we just observe, we, politicians or revolutionary
leaders, do not act by forcing variables to specified values and we do not know the
result of such actions.
The study of the implications of an action toward reduction of inequality, whatever
the origin of such action, would require a “causal” approach which is not developed in
the paper.
The causal approach, given the impossibility of experiments, should be based on a
161
structural model to specify in detail which shape would take the policy toward changing
income concentration (or better, land concentration) and, for instance, clarify under
which conditions such policy would keep unaltered both “the other variables” and the
overall structure of the conditional expectation.
Alternatively, the structural model should specify in which ways the policy action
would change these.
Any causal interpretation is simply without grounds if these conditions are not
satisfied.
We pass now to a second paper: “Fairness and Redistribution”, Alberto Alesina
and George-Marios Angeletos, The American Economic Review, Vol. 95, No. 4 (Sep.,
2005), pp. 960-980.
162
Here we see an example of a rather anomalous model.
The dependent variable is bounded, this by itself may create problems to the validity
of OLS hypotheses54 , but we do not comment on this.
Instead, we choose this example because it is a clear instance of the “significance
thru sample size”+”statistically significant means relevant” pitfall we mentioned above.
The idea is as follows.
Under OLS hypotheses, the variance of the estimates is, roughly, decreasing linearly
in n the number of observations. This means that even very small parameters, of no
practical consequence, can be estimated with such a precision to be distinguishable from
0. However, “statistically significant” actually means only this, roughly: “the parameter
estimate as a sampling variance small enough that we are able to say the parameter
is not exactly zero”. In other words, very small parameters can be distinguished from
0, if n is big: they are “statistical significant”, their T −ratio are big enough to reject a
null hypothesis of zero for any sensible size of error of the first kind.
As explained above, this is not sufficient to say that such parameters have a size
which is relevant in economic sense or any other sense. Readers of standard books of
Statistics are frequently warned about this point, but the confusion between “statistical
significance” and “relevance” is still there to be observed in applied research across very
different fields (for this exact reason books on basic Statistics, and these handouts, still
warn you about this problem).
In a typical case with “big n”, but not very much relevant parameters, we have
regressions where most parameter are “statistically significant” while the overall R2
and/or, the semi partial R2 of the parameters of interest, is very small. In cases like
this, the correct overall interpretation is: “OK: I have a statistically very stable estimate
of a very small parameter. This implies the irrelevance of the corresponding variable
variable in the regression at least in the sense that it contributes very little to the
variance of the dependent variable55 . In this sense I can say that the overall effect of
he variable (in forecast terms, without further causal analysis) is well estimated to be
negligible”.
In other words, a simple look to the R2 (or semi partial R2 value), joint to the fact
that the number of observations is big so that, if the OLS hypotheses are valid, the
estimate the parameters (and of the R2 ) is statistically very stable, should prevent any
further analysis of the model except an analysis directed to establish why variables
Both in the weak and strong version. Since the dependent variable is between 0 and 1, the forecast
plus the error should be bounded between 0 and 1. This implies a (non linear) dependence between
forecast and error as a forecast near, say, one, is compatible only with a small positive and possibly
bigger negative error. Moreover errors cannot be Gaussian as this probability distribution support is
unbounded. More specific models (Logit, Probit, etc.) exist for this kind of data, however, the use of
a linear regression, named: “linear probability model”, can still be a first useful approximation.
55
We discussed the fact that it may sometimes be the case that some variable which contributes
little to the R2 is “relevant” in some other sense. Moreover, sometimes to discover that some variable
does not contribute to the R2 could be quite of interest.
54
163
that, a priori, the researcher considered relevant, seem to have, in the data, negligible
effect.
In fact, the relevant (and it IS relevant) information we can derive from the model
is that the effect of any of the explanatory variables on the dependent variable is
substantially zero. Notice that this could also imply a problem in the design of the
empirical analysis and is quite useful an information.
Notwithstanding the usefulness of such results, sometimes (and in some field) both
researchers and journals consider such results non satisfactory. This may be the reason
why, for instance in this particular case, we read:
“As Table 2 shows, we find the belief that luck determines income has a strong and
significant effect on the probability of being leftist.”
Notice the term “effect”. As in the case of the previous paper, this could be implying,
or not, a causal analysis, which is not developed in the paper.
The doubts on the effective purpose of the paper in this respect are the same as
those raised before, so we do not discuss these any more.
Here, we shall intend the term as “ability to forecast” and not as anything to do
any with possibility of intervention.
Now to the “strong and significant” clause. This is what the Authors write. Let
us see how this idea, which is in contrast with the reported results, could arise (the
contrast in not with the word “significant”, if this means “statistically significant”, but
with the word “strong” if it is taken with any sensible meaning implying any relevant
“explanatory” power of the regressors).
The dependent variable is a 0 to one index related to the answer to the question:
"In political matters, people talk of left and right. How would you place your views on
this scale, generally speaking?".
The variable of interest for the Authors is the “individual belief that Luck determines
income” (which, I think, is again a 0 to 1 variable). The corresponding estimate
coefficient, depending on the model, is .54 or .607 (but you should consider the second,
why?) the corresponding T −ratio is 3.88 (at least, we suppose that this is the T −ratio
as stated by the Authors. In fact the T −ratio should have the same sign as the
parameter estimate while the reported ratios are all positive).
On the basis of this info the statement of a “strong and significant effect” of the
belief about luck is unwarranted. In fact the Authors include both the model with
and the model without the variable of interest and the overall R2 changes of only .01.
This would be a direct estimate of the semi partial R2 but we must take into account,
however, that the samples are not the same for the three regressions.
Using our Corollary with n = 14998 and k = 16 we have that the contribution of
the “luck” variable to the overall R2 is almost exactly .001.
The square root of this is roughly 3/100 so that the expected change in the conditional expected value of the dependent variable due to a change (“keeping the rest
constant”) of one conditional standard deviation of the explanatory variable is of the
164
order of 3% of the standard deviation of the dependent variable.
While statistically significant (a T-ratio above the 5% level) the effect is in fact
negligible or, in better words, it is well estimated to be of a negligible size.
Suppose that the “luck” variable is itself between 0 and 1 and consider the extreme
values and even suppose there is zero correlation between the “luck” variable and the
other variables. The difference of the conditional expectation in the extreme cases is of
.607, which in a 0-1 scale may seem big. However, what you observe is not the expected
value of the dependent variable but this plus the error, and only 1/1000 of the variance
of expected value plus error is due to the (extreme) change in the explanatory variable,
hence to a change in the expected value.
Sure, it is easier to win in betting on head with a coin where the probability of
head is .501 than .500 however, I would not say that I have a “significant and strongly
higher probability” to win if I bet on head (even if such a very small difference in
probability can be very well estimated, and so be statistically significant, if the number
of observations is very big).
What could be said is that n is so big that even small differences between the
probabilities of head and tail can be estimated with statistical reliability even when
very small, this is the only meaning of “statistical significance” 56 .
Our last example is drawn from “Does Culture Affect Economic Outcomes?”; Luigi
Guiso, Paola Sapienza, Luigi Zingales, The Journal of Economic Perspectives, Vol. 20,
No. 2 (Spring, 2006), pp. 23-48.
At this point in the Handouts the Reader should have understood a leitmotiv of our presentation:
a good user of Statistics, before even observing data, should have a clear idea about which “size” of
“effects” can be distinguished on the basis of the available data and about the implied ability of the
data to yield relevant information.
56
165
The dependent variable here is a 0, 1 (not 0-1, only 0 and 1) variable where 1 means
that the respondent is self employed and 0 that is employed but not self employed. As
in the previous case there are problems in using linear regressions in this case but we
do not discuss this.
The explanatory variable is “Trust” and it is a dummy variable equal to 1 if there is
a positive answer to a question related to “trust in others”. Other variables are added
as “controls”.
Again, while the purpose of the paper is not clear, here by “effect” we do not intend
the effect of an action but just a measure related to forecasting.
Probably, the Authors have in mind a causal effect. In fact the Authors use a
method (instrumental variables) which we do not discuss in this introductory course
166
and which tries to estimate the regression coefficient of a variable in a regression with
many variables of which some is not observed. A common mistake is to consider this as
equivalent to the measure of a “causal” effect. This is wrong but, both the introductory
level of this course and the information contained in the paper cannot allow us to go
further on in this topic.
The estimate to be considered is the one corresponding to the second model (as
usual, if estimates change adding variables it is better only to consider the model with
the greater number of variables). The value is .0167 and the standard deviation .0046
(this is an estimate derived with a formula somewhat different than the OLS one but
this does not change out interpretation of the result.
We do not have the overall R2 , however we can use our corollary in order to estimate
the amount of R2 due to the Trust variable. with n = 22791 and k = 17 we have that
this amount is, at most, 0.0005.
The Authors comment is: “As Table 1 reports, trust has a positive and statistically
significant impact on the probability of becoming an entrepreneur in an ordinary least
squares regression (the probit results are very similar). Trusting others increases the
probability of being self-employed by 1.3 percentage points (14 percent of the sample
mean)”.
This sentence (again: in a forecast sense) IS (partially) correct, as the impact IS
positive and statistically significant. Moreover using the term “probability” the Authors
are clearly considering the expected value of the dependent variable (they should use
the term “conditional probability” but this would perhaps be too pedantic).
If we recall that the square root of the semi partial R2 is equal to:
p
ˆ |
V ar(Xj |X−j )|β
j
p
V ar(Y )
and compute this, we get a value the order of 2.2%. Since the dependent variable
empirical variance is n0.092(1 − 0.092) (1.3 is 14% of the sample mean and the sample
mean is the relative frequency of ones) a change of the square root of this of the order
of 2.2% is equivalent to a change of the frequency of self employed of less that 0.005 (a
bit improperly you may think to this as the change of roughly 110 units in the sample
of 22791 units from not self to self employed).
The Authors are not claiming that the effect of Trust is of any practical relevance,
however they do not point out, on the opposite, that the effect is arguably WITHOUT
ANY practical consequence.
It is clear that, in this example, the sample is so big that, if we suppose standard
iid hypotheses are valid, very small “effects” can be measured with precision, hence be
statistically significant, but such precision in estimating small effects does not make
them relevant.
The purpose of the paper is subsumed in the following sentence: “Having shown
167
that culture as defined by religion and ethnicity affects beliefs about trust, we now
want to show that these beliefs have an impact on economic outcomes”.
The empirical results of the paper do not seem to clearly point to this conclusion.
This is not to affirm that “culture has no effect on economic outcomes”, it is very
likely that such effects exist.
The problem is how to measure these effects, as defined by the Authors, with the
available data. The correct reading of the empirical results of the paper is that, under
the definition and with the data of the paper, such effects are measured as substantially
negligible.
Since we can a priori argue for the existence of such effects, the point is now to
find, if possible, a proper empirical measure/definition of “culture” and proper data
such that those effects can actually be estimated. Obviously, it could be the case that
with a proper definition of “culture”, a simple measuring of its “effects” as based on a
linear regression model, shall be seen as inappropriate.
Culture is a very rich construct, it is likely difficult to reduce it, in a meaningful way,
to one of more qualitative or cardinal variates. Even when this is possible, why should
be its “effects” be expressed and measurable as monotonic, even linear, contributions
to the conditional expectation of some variable?
Any analysis of “cultural effects on economics” the like of the one contained in
the quoted paper should begin by suggesting a solution to these practical modeling
problems on the penance of irrelevance. In this particular case, in fact, a correct
reading of the results strongly suggest for the irrelevance not of culture for economics
but of this way of measuring it.
9.12.16
Concluding summary
As an overall comment to these examples, I would like to stress the need for any user
of economic or financial research, and more in general empirical research, to provide
him/herself with the basic tool for “filtering out” excusable rhetoric noise from content
when reading other people’s papers.
It is absolutely understandable, if maybe a little scary from point of view of a
laymen’s view of science, for Authors to “sell” a paper and try to put their results in
the best possible light.
This is true in all fields.
However, empirical research has a role and a consequence only when both researchers and readers “share the code” which allows them to separate effective content
from (admissible) advertising.
The knowledgeable reader shall understand that, when we consider the selection in
the universe of papers of those actually published in main journals, the hypothesis that
“paper salesmanship” counts in having a paper accepted implies that many examples
shall be found in main journals similar to those summarized here.
168
Clearly, it is hopefully far less likely that interesting results shall be rejected for
lack of salesmanship. In fact, it should be easy to sell really interesting results. These
should “sell themselves”.
Such a selection effect must be taken into account while reviewing any strain of
empirical literature.
There exists a subfield of Statistics called “meta analysis” that deals with these
problems. Interestingly meta analytic studies of the literature are quite frequent in
medicine and biology, not, until recently, in Economics or Finance.
Examples
Exercise 9-Linear Regression.xls
10
Style analysis
Style analysis is interesting both from the point of view of practitioner’s finance and
as an application of the linear regression model.
The current version of the model was elaborated by William F. Sharpe in a series
of papers beginning in 1989. In this summary we shall refer to the 1992 paper (as of
November 2018 you may download it at http://www.stanford.edu/∼wfsharpe/art/sa/sa.htm).
In order to understand the origin of the model we must recall the intense debate
developing during the eighties about the validity of the CAPM model, its possible
substitution with a multifactor model and the evaluation of the performance of fund
managers.
In a nutshell (back to this in some more detail in the next chapter): a factor
model is a tool for connecting expected returns of securities or securities portfolios to
the exposition of these securities to non diversifiable risk factors. The CAPM model
asserts that a single risk factor, the “market”, or, better, the random change in the
“wealth” of all agents invested in the market, is priced in terms of a (possible) excess
expected return. This factor is empirically represented by the market portfolio, that
is: the sum of all traded securities. The expected return of a security in excess of the
risk free rate (remember that we are considering single period models) is proportional
to the amount of the correlation between the security and the market factor. The
proportionality factor is the same for all securities and is called price of risk.
Multifactor models, such as the APT, suggest the existence of multiple risk factors
(not necessarily traded) with different prices of risk, so that the cross section of expected security (or security portfolios) excess returns is “explained” by the set of the
security expositions to each factor. Classical implementations of the APT were based
169
on economic factors, some were tradable, like the slope of the term structure of interest rates, some, at least at the time, non tradable, as GNP growth and inflation. At
the turn of the nineties Fama and French, followed by others, produced a number of
papers where factors were represented by spread portfolios. The most frequently used
factors were based on the price to book value ratio, on the size of the firm and on some
measure of market “momentum” (relative recent gain or loss of the stock w.r.t. the
market). These factors were represented, in empirical analysis, by spread portfolios.
As an instance: the price to book value ratio was represented by the p&l of a portfolio
invested, at time zero, in a zero net value position long in a set of high price to book
value stocks and short in a set of low price to book value stocks. Fama and French
asserted that the betas w.r.t. this kind of factor mimicking portfolios were “priced by
the market”, that is, the correlation of a stock return with such portfolios implied a
non null risk premium.
Consider now the problem of evaluating the performance of a fund manager. A
preliminary problem is to understand for which reason you, the fund subscriber, should
pay the fund manager. Obviously, you should not pay the fund manager beyond
implementation costs (administrative, market transactions etc) for any strategy which
is known to you at the moment you subscribe to (or do not withdraw from) the fund
if this strategy gives “normal” returns and if you can implement it by yourself.
Suppose, for instance, that the asset allocation of the fund manager is known to
you before subscribing the fund. Since the subscription of the fund is your choice
the fund manager should not be paid for the fund results due to asset allocation, or,
better, should not be paid for this beyond implementation costs. A bigger fee could be
justified only if, by the implementation of management decisions you cannot forecast
on the basis of what you know, the fund manager earns some “non normal” return.
This is the reason why index funds should (and, in markets populated by knowledgeable investors, usually do) ask for small management fees. What we say here is
that this should be the same for any fund managed with some, say, algorithm, replicable on the basis of a style model like, for instance, funds which follow asset selection
procedures based on variants of the Fama and French approach (that is: stock picking
based on observable characteristics of the firms issuing the equity as, for instance, accounting ratios, momentum etc. While implementing such models requires some care
and a lot of good data management, the reader should be aware of the fact that nothing
magic or secret is required for the implementation of these algorithms.
The fund manager contribution, with a possible value for you, if any, should be
something you cannot replicate, that is: either something arising from (unavailable
to you) abilities or information of the manager or, maybe, from some monopolistic or
oligopolistic situation involving the manager. Let us suppose (a very naive idea!) that
the second hypothesis is not relevant. A formal way to say that the manager ability
is not available to you is to say that you cannot replicate its contribution to the fund
return with a strategy conceived on the basis of your knowledge.
170
Notice that for this reasoning to be valid it is not required that you actually perform
any analysis of the fund strategy before buying it. Perhaps we could agree on the fact
that you should perform such an analysis, before buying anything. A mystery of finance
is that people spend a lot of money in order to buy something whose properties are
unknown to the buyer. People wouldn’t behave in this way when buying, say, a car
or even a sandwich. However, any lack of analysis simply means that something more
unexpected by you, shall become (on your opinion) merit or fault of the fund manager.
It is important to understand that, according to this view, the evaluation of the
performance of a fund manager is, first of all, subjective. It is the addition of hypotheses
on the set of information used by subscribers and on their willingness to optimize using
these information that can convert the subjective evaluation into an economic model.
The problem here is, obviously, to define what we mean by “normal return” and
“known strategy”.
Here a market model, representing efficient financial use of public information, could
be the sensible solution. Were the market model and the effective asset manager’s asset
allocation available, the first could be used to define the efficiency of the second and,
by difference, possible unexpected (by the model) over or under performances on the
part of the fund manager.
Alas, for reasons that shall be discussed in following sections, satisfactory empirical
versions of market models still have to appear or, at least, versions of market models,
and statistical estimates of the relative parameters, strong enough to be agreed upon
by everybody an so useful in an inter-subjective performance analysis.
A less ambitious and more empirically oriented alternative is return based style
analysis. This alternative yields a (model dependent) subjective statement about the
quality of the fund. We shall return on this point but we stress the fact that, if the
purpose of the method is for a potential subscriber or for someone already invested in
the fund to judge the fund manager performance and not for some agency to award
prizes, the subjective component of the method is by no means a drawback.
Return based style analysis can be seen as a specific choice of “normal return” and
“known strategy” definitions. The “known strategy” is the investment in a set of tradable assets (typically total return indexes) according to a constant relative proportion
strategy, the “normal return” is the out of sample return of this strategy previously
tuned in order to replicate the historical returns of the fund. This point has to hammered in so we repeat: the strategy is not chosen in order to yield “optimal” returns (in
any case the lack of a market model would impede this) but only in order to replicate
as well as possible, in the least squares sense, the returns of the fund strategy.
In order to estimate the replica weights, the returns (RtΠ ) of the fund under investigation are fitted to constant relative proportion strategy with weights βj invested in
a set of k predetermined indexes with returns Rjt :
X
RtΠ =
βj Rjt + t
j=1,...,k
171
The term “constant relative weights strategy” indicates, as usual, a strategy where
the proportion of wealth invested in any given index is kept constant over time. This
implies that, when some index over performs other indexes, a part of the investment
in the over performing index must be liquidated and invested in the under performing
indexes.
For the sake of comparison other possible strategies could be the buy and hold
strategy where a constant number of shares is kept for each index and the trend following strategy, where shares of “loser” indexes are sold to buy shares of “winner” indexes.
Both these strategies have variable weights on returns and could reasonably be used
as reference strategies.
There exist variants of the constant relative proportions strategy itself. In a constrained version the weights could be required to be non negative (short positions are
not allowed). In another version weights could be allowed to change over time (in this
case we should assume that the sum of all weights is constant over time).
In typical implementations no intercept is in the model and the sum of betas is
constrained to be one. The constant is dropped because it is usually interpreted as a
constant return and, over more than one period, a constant return cannot be achieved
even from a risk free investment. The assumption that the sum of all weights is one is
an assumption required for the interpretation of the weights as relative exposures and,
in the case of a multi period strategy, in order for the portfolio to be self financing.
While both interpretations and both constraints could be challenged, in our applications we shall stick to the common use. We only relate the fact that, sometimes,
instead of imposing the “sum to one” constraint explicitly at the estimation time 57 this
is implemented on a a posteriori basis by renormalizing estimated coefficients. The two
methods do not yield the same results.
A relevant point in the choice of the reference strategy is that it should not cost
too much. In this sense the constant relative proportions strategy could be amenable
to criticism as it can imply non negligible transaction costs. The reason for its use in
style analysis seems more leaning on tradition than on suitability.
Notice that in no instance we are supposing that the fund under analysis actually
follows a constant relative proportion strategy invested in the provided set of indexes.
We are NOT trying to discover the true investment of the fund but only to replicate
its returns as best as we can with some simple model. This point has to be underlined
because, at least in the first paper on the topic, Sharpe himself seems to state that
the purpose of the analysis is to find the actual composition of the fund. This is
obviously impossible if it is not the case that the fund is invested, with a constant
P
The j βj = 1 constraint can be imposed to the OLS model in a very simple way. First chose
any Rjt series, say R1t . Typically the choice falls on some series representing returns from a short
term bond but any choice will do. Second compute R̃t = Rt − R1t and R̃jt = Rjt − R1t for j = 2, ..., k.
Now regress R̃t on the R̃jt for j = 2, ..., k. After running the regression the coefficient for R1t , which
Pk
you do not directly estimate, shall be equal to 1 − j=2 βj .
57
172
relative proportions strategy, in the indexes used in the analysis.
In fact, the actual discovery of the composition of the fund and its evolution over
time would hardly add anything to the purpose of identifying the part of the fund’s
strategy not forecastable by the fund subscriber. A model would still be needed in
order to divide what is forecastable from what is unforecastable in the fund evolution.
Let us go back to the identity:
X
RtΠ =
βj Rjt + t
j=1,...,k
Up to now this is not an estimable model but, as said above, an identity. In order to
convert it into a model we must assume something on t . A way of doing this is to recall
the chapter on linear regression. The style model is clearly similar to a linear model.
In particular it is similar to a linear model where both the dependent and independent
variables are stochastic. In this case we know that a minimal hypothesis for the OLS
estimate to work is that E(|RI ) = 0 where is the vector containing the observations
on the n t -s and RI is the matrix containing the n observations on the returns of the
k indexes. The second, less relevant, hypothesis is the usual E(0 |RI ) = σ2 In .
The hypothesis E(|RI ) = 0 has a sensible financial meaning: we are supposing
that any error in our replication of the fund’s returns is uncorrelated with the returns
of the indexes used in our replication.
Sharpe’s suggestion for the use of the model in fund performance evaluation is as
follows: given a set of observations (typically with a weekly or lower frequency, Sharpe
uses monthly data) from time t = 1 to time t = n fit the style model from t = 1 to
t = m < n and use the estimated coefficient for forecasting Rm+1 then add to the
estimating set the observation m + 1 (and, in most implementations) drop observation
1. Forecast Rm+2 and so on. These forecast represent the fund’s performances as due
to its “style” where the term “style” indicates our replicating model. The important
point is that this “style” result is forecastable and, in principle, replicable by us. The
possible contribution of the fund manager, at least with respect to our replication
strategy, must be found in the forecast error. The quality of the fund manager has to
be evaluated only on the basis of this error.
There are three possibilities:
• The fund manager return is similar (in some sense to be defined) to the replicating
portfolio return. In this case, since you are able to replicate the result of the fund
manager strategy using a “dumb” strategy, you shall be willing to pay the fund
manager only as much as the strategy costs.
• The fund manager returns are less than your replica returns. In this case you
should avoid the fund as it can be beaten even using a dumb strategy which is
not even conceived to be optimal but only to replicate the fund returns. This is a
173
strong negative result. While it is true that it is possible to find alternative assets
that, when calibrated to the fund returns in a style analysis, give a positive view
of the same manager results, the fact that a simple strategy exists that beats the
fund returns is enough to put under discussion any fund manager’s ability.
• The fund manager returns are better than your replica strategy. In this case it
seems that the manager adds to the fund strategy something which you cannot
replicate. This is an hint in favor of the fund manager ability. It is a weak hint,
for the same reason the negative result is a strong hint. The negative result is
strong because a simple strategy beats the fund manager’s one, the positive result
is weak because the fund manager beats a simple strategy but other could exist
which equate or even beat the fund manager strategy. In any case this is an at
least necessary condition for paying a fee greater that the simple strategy costs.
The important point to remember, here, is that the result is relative to the strategy
and the asset classes used. No attempt is made to build optimal portfolios with the
given asset classes, only replica portfolios are built. The reader should think about the
possible extensions of procedures like style analysis were a market model available
A simple example of style analysis using my version of Sharpe’s data and three well
known US funds is in the worksheet style analysis.xls.
10.1
Traditional approaches with some connection to style analysis
The idea that you should find some “normal” return with which to compare a fund
return and that this definition of “normal” return is to be connected with the return of
some “simple strategy” related with the fund’s strategy is so basic that many empirical
attitudes are informally justified by it.
On a first level, we observe very rough fund classifications in “families” of funds,
defined by broad asset classes. This suggests comparisons of funds to be made only
inside the same family. In a sense the comparison strategy is implicitly considered as
a mean of the strategy in the same asset class.
Another shadow of this can be found in the frequently stressed idea that the result
of any fund management must be divided between asset allocation and stock picking.
In common language this partitioning is not well defined and asset allocation may
mean many different things as, for instance, the choice of the market, the choice of
some sector, the choice of some index. Moreover there is no precise definition of how
to distinguish between asset allocation and stock picking. But it is clear that this
distinction, again, hints at some normal return, derived by asset allocation, and some
residual: stock picking.
The “benchmarking” idea is another crude version of the same: you try to separate the fund manager’s ability from the overall market performance by devising a
174
benchmark which should summarize the market part of the fund manager strategy.
Market models can be seen as a step up the ladder. Here the benchmark idea is
expressed in a less naive way. Under the hypothesis that the market model holds and
is known and the beta (CAPM) or betas (APT) of the fund are known, the part of
the result due to the market factor(s) is to be ascribed to the overall fund strategic
positioning and, as such, its consequences are in principle a choice of the investor. Any
other over or under performances can be ascribed to the fund manager abilities and
private information.
As we mentioned above, this use of market model is greatly hampered by the fact
that the proposition “...the market model holds and is known and the beta (CAPM)
or betas (APT) of the fund are known” simply does not hold.
Now a few words on comparison criteria.
The classical Sharpe ratio considers the ratio of a portfolio return in excess to
a risk free rate to its standard deviation. Even in this form the Sharpe ratio is a
relative index: the fund performance is compared to a riskless investment. In general
this comparison is not a useful one. Typically our interest shall be to compare the
fund performance with a specific strategy, which, in some instance, could be the best
possible replication of the fund’s returns accomplished using information available to
the investor. In many cases this reference strategy shall be a passive strategy (this
does not mean that the strategy is a buy and hold strategy but that the strategy can
be performed by a computer following a predefined program).
As considered before, such a strategy could be provided, for instance, by some asset
pricing model (CAPM, APT etc.). In other cases the reference strategy could simply
be represented in the choice of a benchmark used either in the unsophisticated way
where, implicitly, a beta of one is supposed (that is, at the numerator of the Sharpe
ratio take the difference between the returns of the fund and those of the benchmark)
or in the more sophisticated way of computing the alpha of a regression between the
return of the fund and the return of the benchmark.
Otherwise the reference strategy could be based on an ad hoc analysis of the history
of the fund under investigation. Style analysis is a way to implement this analysis.
Two relevant final points.
First: the comparison strategy should always be a choice of the investor. It is rather
easy, from the fund’s point of view, to choose as comparison a strategy or a benchmark
with respect to with the strategy of the fund is superior, at least in terms of alpha. This
is known as “Roll’s critique”. While the fact that the strategy chosen by the investor as
comparison is dominated by the fund strategy is admissible as, usually, the fund does
not tune its strategy to this or that subscriber comparison strategy (at least this is true
if the subscriber is not big!), when it is the fund to choose the comparison strategy a
conflict of interests is almost certain.
Second: once identified the part of the strategy due to the fund manager intervention, a summary of this based on the Sharpe ratio or on Jensen’s alpha is only one
175
of the possible choices and strongly depends on the subscriber’s opinion on what is a
proper measure of risk and return.
10.2
Critiques to style analysis
Under the hypotheses and the interpretation described in the previous section style
analysis can be considered an useful performance evaluation tool. However, at least in
the version suggested by Sharpe, it lends itself to some strong critique.
A first very simple critique concerns the choice of the replicating strategy. While
the use of indexes does not create big problems, at least when these indexes can be
reproduced with some actual trading strategy, a big puzzle lies in the choice of a
constant relative proportion strategy. This is both an unlikely and a costly strategy,
due to portfolio rebalancing. The typical simple strategy is the buy and hold strategy,
most indexes are, in principle, buy and hold strategies and the market portfolio of
CAPM is a buy and hold strategy. As seen in chapter 1 the buy and hold strategy
is NOT a constant relative proportion strategy. Moreover, a buy and hold strategy,
typically, implies very small costs (the reinvestment of dividends is the main source of
costs if there is no inflow or outflow of capital from the fund) while a constant relative
proportion strategies implies a frequent rebalancing of the portfolio.
Now, the replicating strategy is a free choice of the analyzer, however, if we simply
suppose that the fund follows a buy and hold strategy in the same indexes used by
the style analyzer we end with a strange, if perfectly natural, result. Obviously the R2
of the model shall not be 1, except in the case of identical returns for all the indexes
involved in the strategy, moreover the analysis shall point out as “unforecastable” and
so due to the fund manager action, any return of the fund due to the lack of rebalancing
implied in a buy and hold strategy.
Suppose, for instance, that some index during the analysis period should outperform
(or under perform) frequently the rest of the indexes used in the analysis. This shall
result in a forecast error for the strategy fitted using a constant relative proportion
strategy which shall attribute to the fund manager a positive contribution to the fund
result. On the contrary, temporary deviations of the return of one index from the
returns of the others shall result, in the comparison of the strategies, in favor of the
constant proportion strategy.58
In the case of a positive trend of, say, an index with respect to the rest of the portfolio, a buy
and hold strategy does not rebalances by selling some of the same index and buying the rest of the
portfolio. In case of a further over performance of the index the buy and hold portfolio shall over
perform the rebalanced portfolio. In the case of a negative trend of some index with respect to the
rest of the portfolio the constant proportion strategy must buy some of the under performing index
selling some of the rest of the portfolio, if the under performance continues this strategy shall imply
an over performance of the buy and hold strategy with respect to the constant relative proportion
strategy. On the contrary, a strategy investing in temporary losers (after the loss!) or disinvesting in
temporary winners shall outperform a buy and hold strategy in a oscillating market.
58
176
A second critique, of theoretical interest but hardly relevant in practice is connected
with Roll’s critiques to CAPM tests and, more in general, to CAPM based performance
evaluation. If the Constant proportion strategy does not contain all the indexes required for composing an efficient portfolio, any investment by the fund manager into
the relevant excluded indexes shall result in an over performance. This would be relevant if the evaluated fund manager should know, ex ante, the style model with which
his/her strategy shall be evaluated AND if the fund manager has a more thorough
information on the structure of the efficient portfolio.
The point is that, while it is rather easy to compute an efficient portfolio ex post,
this is not so easy ex ante. Moreover, if we accept the idea that the style decomposition
depends on the information of the analyzer, this critique loses much of its stance.
A third, and more subtle, critique can be raised to style analysis as well as to any
OLS based factor model used for performance evaluation. If the model is fitted to the
fund returns, the variance (or sum of squares, if no intercept is used) of the replicating
strategy shall always be less than or equal to that of the fund returns. In a CAPM
or APT logic this is not a problem, since only non diversifiable risk should be priced
in the market. However, as stressed above, we are NOT in a CAPM or APT world.
With this lack of variance we are giving a possible advantage to the fund. Ways for
correcting this problem can be suggested and, in fact, performance indexes which take
into account this problem do exist. However, since, as we saw above, the positive (for
the fund) result is already a weak result in style analysis, this undervaluation of the
variance is only another step in the same direction: negative valuations are strong,
neutral or positive valuations could be challenged.
A last word of warning. Many data providers and financial consulting firms sell style
analysis. As far as I know, the advertising of commercial style models invariably asserts
the ability of such models to discover the true composition of the fund portfolio and
most reports produced by style analysis programs concentrate on the time evolution
(estimated by some rolling window OLS regression) of portfolio compositions. This is
quite misleading (Sharpe is somewhat responsible as in the original papers he seems
to share this opinion) and can be accepted only if interpreted as a misleading way to
asses the true purpose of the strategy, that is: return replication. As far as I know,
the typical seller and user of style analysis, if not warned, tends to believe the “fund
composition” story. This false ideas usually disappears after some debate, provided, at
least, that the user or seller is even marginally literate in simple quantitative methods.
Examples
Exercise 10-Style Analysis.xls
177
11
11.1
11.1.1
Factor models and principal components
A very short introduction to linear asset pricing models
What is a linear asset pricing model
Let us begin by considering the following plot. Here you find yearly excess total linear
return averages and standard deviations for those stocks which were in the S&P 100
index from 2000 to 2019 (weekly data, 83 stocks).
As you can see, stocks with similar average total return show very different standard
deviations and viceversa. We know that the statistical error in the estimate of expected
returns using average returns may be big, however, if we believe that average return
has anything to do with expected return and standard deviation with risk the plot is
puzzling (and these are 84 BIG companies).
We can see asset pricing models as tools devised to answe the kind of puzzles which
plots like this one may raise.
Among these two of the oldest and most relevant questions of Finance:
1. in the market we see securities whose prices evolve in completely different ways.
There may even be securities that have both mean returns lower and standard
deviations of returns higher than other securities. Why are all these securities,
with such apparently clashing statistical behaviours, still traded in equilibrium?
2. which are the “right” equilibrium relative prices of traded securities?
178
(Not be puzzled by the fact that we speak of asset pricing models and we write returns.
Given the price at time 0, the return between time 0 and time 1 determines the price
at time 1).
We anticipate here the answers to these two questions given by asset pricing models:
1. securities prices can be understood only when securities are considered within a
portfolio. Completely different (in terms, say, of means and variances of returns)
securities are traded because they contribute in improving the overall quality of
the portfolio (in the classic mean variance setting this boils down to the usual
diversification argument), what is relevant is not the total standard deviation of
each security but how much of this cannot be diversified out in a big portfolio, for
this reason the expected return of a security return should not be compared with
its standard deviation but only with the part of this standard deviation which
cannot be diversified ;
2. the right (excess) expected returns of different securities should be proportional
with the non diversifiable risks “contained” in the returns, to equal amounts of
the same risk should correspond equal amounts of expected return.
These are not the only observed properties of asset prices/returns asset pricing models
try to account for. Another striking property is as follows: while thousands of securities
are quoted, there seems to be a very high correlation, on average, among their returns.
In a sense it is as if those many securities were “noisy” versions of much less numerous
“underlying” securities.
For instance, the 83 stocks of the S&P 100 displayed above show an average (simple)
correlation (over 20 years!) of .31. If we recall the discussion connected with the
spectral theorem and compute the eigenvalues of the covariace matrix of these returns,
while we see no eigenvalue equal to 0, the sum of the first 5 eigenvalues is greater the
50% of the sum of all eigenvalues, where the last, say, 50 eigenvalues count for about
15% of the total. The sole first eigenvalue is about 1/3 of the total. This suggest the
idea that, while not singular, the overall covariance matrix can be well approximated
by a singular covariance matrix.
It should be clear how to answer to these questions and to model the the high
average correlation of returns is important for any asset manager and, in fact, asset
pricing models are central to any asset management style not purely based on gut
feelings.
We can deal with these problems within a simple class of asset pricing model known
as: “linear (risk) factor models”. Here we show some hint of how this is done in practice.
An asset pricing model begins with a “market model” that is a model which describes asset returns (usually linear returns) as a function of “common factors” and
“idiosyncratic noise”. These models are, most frequently, linear models and a typical
market model for the 1 × m vector of excess returns rt observed at time t, the 1 × k
179
vector f of “common risk factors” observed at time t and and the 1 × m vector of errors
is:
rt = α + ft B + t
Where B is a k × m matrix of “factor weights” and α is a 1 × m vector of constants.
We suppose to observe the vectors rt and ft for n time periods t. Stacking the
n vectors of observations for rt and ft in the n × m matrix R and the n × k matrix
F , and stacking the corresponding error vectors in the n × m matrix we suppose:
E(|F ) = 0, V (t |F ) = Ω and E(t 0t0 |F ) = 0 ∀t 6= t0 . In order to give meaning to the
term “idiosyncratic” the contemporaneous correlation matrix Ω is, as a rule, supposed
to be diagonal, typically with non equal variances.
It is relevant to stress the fact that such a time series model can be a good explanation of the data on R (for instance it may show high R2 for each return series) and
at the same time no asset pricing model could be valid.
Let us recall that, if we estimate the market model with OLS (this may be done
security by security or even jointly) The OLS estimate of α can be written in a compact
way as
α̂ = r̄ − f¯B̂
Where r̄ is the 1×m vector of average excess return (one for each security excess return
averaged over time), f¯ is the 1 × k vector of average common factors values (again:
averaged over time) and B̂ is the matrix of OLS estimated factor weights (one for each
factor for each security: k × m.
The expected value of this, under the above hypotheses, is:
E(α̂) = E(r) − E(f )B
As we shall see in a moment, an asset pricing model is valid, if, supposing Ω diagonal,
we have that α = 0.
This is usually written as:
E(r) = λB
Where λ = E(f ) is a 1 × k vector of “prices of risk” and in a moment we shall see why
this name is used.
It is now important to stress that this restriction may hold, so that the asset pricing
model is valid, and this not withstanding the time series model could offer a very poor
fit of r or, on the contrary, the fit could be very good and α 6= 0.59
For asset management purposes, however, a possible good fit for the time series
model with a k << m could be very useful even when the asset pricing model does not
hold.
Beware: what just described as a possible test of an asset pricing model is useful for the understanding of the loose interplay between the time series model and the asset pricing model but it is,
typically, not a very efficient way, from the statistical point of view, to test the validity of an asset
pricing model.
59
180
Suppose, for instance, you want to use a Markowitz model for your asset allocation.
In order to do this you need to estimate the variance covariance matrix of returns.
This requires the estimate of m(m + 1)/2 unknown parameters using n observations
on returns. With a moderately big m this could be an hopeless task.
Suppose now the market model works, at least in the time series sense, in the sense
that the R2 of each of the m linear models is big. In this case the variances of the
errors are small and:
V (rt ) = B 0 V (ft )B + Ω ∼
= B 0 V (ft )B
Let us now count the parameters we need to estimate the varcov matrix of the
excess returns with and without the market model. Without the market model, the
estimation of V (rt ) would require the estimation of m(m + 1)/2 parameters, while with
the factor model it requires the estimation of k × m + k × (k + 1)/2 parameters, that is
B and V (ft ). Suppose for instance m = 500 and k = 10, the direct estimation of V (rt )
implies the estimation of 125250 parameters while the (approximate) estimate based
on the factor model “only” 5000+55 parameters.
The reader should notice that, even if the above assumptions for V (rt ) are right,
the use of B 0 V (ft )B in the place of the full covariance matrix shall imply an underestimation of the variance of each asset return, which is going to be negligible only if all
the R2 are big.
Let us move on a step. We must remember that our aim is the construction of
portfolios of securities with weights w and excess returns rt w.
In this case we are not necessarily interested to the full V (rt ) but to variance of the
portfolio
V (rt w) = w0 B 0 V (ft )Bw + w0 Ωw
It is well possible that w0 Ωw be small, so that w0 B 0 V (ft )Bw be a good approximation
of V (rt w), even if it is not true that all R2 are big and, by consequence, the diagonal
elements of Ω small.
Suppose that the weights w of different securities in this portfolios are all of the
order of 1/m. This simply means that no single security dominates the portfolio.
We have, then
m
m
X
1 X ωi
0
2
w Ωw =
wi ωi ≈
m i=1 m
i=1
and this, with bounded, but not necessarily small, diagonal elements of Ω: ωi , goes to
0 for m going to infinity.
This means that, for large, well diversified, portfolios “forgetting” Ω is irrelevant
even if its diagonal elements are not small. The hypothesis of a diagonal Ω, that is:
idiosyncratic t is fundamental for this result.
From this result, we can shed some light on the reason why we should have E(r) =
E(f )B = λB, that is: why an asset pricing model should hold.
181
In order to understand this, it is enough to compute the expected value and the
variance of our well diversified portfolio (notice the approximation sign for the variance)
E(rt w) = E(rt )w = αw + E(ft )Bw.
V (rt w) ∼
= w0 B 0 V (ft )Bw
Suppose now α 6= 0, recall that B is a k × m matrix with (supposedly) k << m and
we can always suppose that the rank of B is k (if this is not the case we can reduce
the number of factors).
This implies that the matrix B 0 V (ft )B ia m × m matrix of rank k < m. The matrix
B 0 V (ft )B is then SEMI positive definite, this implies that there exist m − k orthogonal
vectors z such that z 0 z = 1 and z 0 B 0 V (ff )Bz = 0.
According to what discussed in the matrix algebra section and in the presentation
of the spectral theorem, under conditions we do not specify here, we can always build
from these a set of weights w$ such that w$0 1 = 1 and αw$ > 0.
You should understand the reason of the dollar sign. The vector w$ is such that it
defines a zero risk portfolio (zero variance) with positive excess return αw$ (since the
variance is zero the expected excess return becomes the excess return).
In other words, we created a risk free security (the portfolio) which yields a return
(arbitrarily) greater than the risk free rate. This is an “arbitrage” as one could borrow
any amount of money at the risk free rate and invest it in the portfolio with a positive
profit and no risk (hence the $). Provided all the financial operations involved (building
the portfolio, borrowing money etc.) are possible, this should not happen if traders are
“reasonable” (and if they know of the existence of the factor model).
The only way to unconditionally (that is: whathever tha choice of w$ ) avoid this is
that α = 0 so that
E(r) = E(f )B = λB
Let us now give a “financial interpretation” of this result.
Since each element βji of B represents the “amount” of non diversifiable factor fj
in the excess return of security i and E(fj ) represents the excess expected return of
a security which has a “beta 1” with respect to the j factor and zero with respect to
the others (if the factor fj is the excess return of a security, this could simply be the
excess return of that security, but this is not required) we may understand the name
“price of risk for factor j” used for E(fj ) = λj and risk premium for factor j” given to
the “price times quantity” product fj βji .
Now that we have a rough idea of how an asset pricing model works, it could be
useful go back to the questions with which this section begun and think a little bit
about how the answers come from the asset pricing model.
We should first notice that the approximation
V (rt w) = w0 B 0 V (ft )Bw + w0 Ωw ∼
= w0 B 0 V (ft )Bw
182
is a formal interpretation of the empirical fact that correlations among quoted
securities returns are on average high.
The interpretation is based on the idea that all returns depend (in different ways)
on the same underlying factors and what is “not factor” is uncorrelated across returns.
For this reason, well diversified portfolios of securities tend to show returns whose
variance only depends on that of factors.
As a consequence, it shall be difficult to build many such well diversified portfolios
which are not correlated among them.
In fact, if we assume the above approximation to be exact, and V (ft ) to be non
singular, only exactly k of such portfolios can be built, the choice unique modulus an
orthogonal transform. In this case it is quite tempting to interpret any choice of such
k non correlated portfolios as a “factor estimate”. Some aspects of this idea shall be
reconsidered when presenting the “principal component” way to risk factor estimation.
Asset pricing models give a very precise answer to the puzzle about the fact that
securities are traded in the market even if they may show, at the same time, lower
average returns and higher standard deviations of other traded securities. This is
a possible equilibrium because what is relevant is not the “absolute risk” (marginal
standard deviation of returns) of a security, but its contribution to the risk/return mix
in a well diversified portfolio. For this reason, we can see a relatively low average return
and a high standard deviation simply because the security showing these statistics as
little correlation with systematic risk factors. The model tells us many other interesting
things regarding this point. For instance: it tells us that us that if we see two securities
with, say, the same average returns and very different return standard deviations, the
correlation between the returns of these securities should be small.
Last: asset pricing models give us formulas for measuring the “right mix” of expected
returns and correlation with systematic risk factors (betas) and this answers to the
question about right equilibrium relative prices. On this basis, asset pricing models
give us an unified framework to precisely quantify and test the equilibrium price system
and to transform the statistical results into asset management tools.
11.1.2
Tests of the CAPM
Empirical analysis of asset pricing model is of central importance for asset management.
This is and introductory Econometrics course and it is not the place for a detailed
analysis of how to test and asset pricing model.
It can, however, be useful just to give a quick idea about how this could be made
in the case of the prototype of all asset pricing models: the CAPM.
The CAPM is a single common risk factor model where the risk factor is the excess
return of a “market portfolio”: rM .
According to CAPM, the expected assets excess return E(ri ) are proportional to
183
the assets betas, the proportionality constant being equal to market excess return:
E(r1 ) = β1 E(rM )
E(r2 ) = β2 E(rM )
...
E(rm ) = βm E(rM )
In the following we explain how linear regression can be used to test CAPM. The kind
of test of the CAPM described here is a quite simple and naive one. Similar to the
first empirical analyses of CAPM at the end of the sixties. Much has been done in the
following fifty years but this is not a topic for this course.
We want to test whether
E(ri ) = βi E(rM ) i = 1, . . . , m.
In this equation, βi is the independent variable and E(ri ) is the dependent variable: in
fact CAPM asserts that E(ri ) is a linear function of βi .
Since E(ri ) and βi are not observable, we must estimate them. E(ri ) is estimated
with ri and βi is estimated by OLS on the factor model as in the previous paragraph.
We consider the regression equation which asserts that ri is a linear function of β̂i
plus an error term (we need to insert an error term since we use estimates)
ri = γ0 + γ1 β̂i + i
i = 1, . . . , m
This is called second-pass regression equation. It is a cross-sectional regression unlike the time series regression of the factor model (in the factor model regression the
observations refer to different times, here the observations refer to different assets):
r1 = γ0 + γ1 β̂1 + 1
r2 = γ0 + γ1 β̂2 + 2
...
rm = γ0 + γ1 β̂m + m
If CAPM is valid, then γ0 and γ1 should satisfy
γ0 = 0 and γ1 = rM ,
where rM is the mean market excess return.
In fact, however, you can go a step further and argue that the key property of the
expected return-beta relationship of CAPM asserts that the expected excess return
is determined only by beta. In particular, if CAPM is valid, the expected excess
return should be independent on non systematic risk, as measured by the variances
of the residuals σi2 , which also are estimated by the factor model. Furthermore the
184
dependence on beta should be linear. To verify both the conclusions of CAPM, you
can consider the augmented regression model
r1 = γ0 + γ1 β̂1 + γ2 β̂12 + γ3 σ12 + 1
r2 = γ0 + γ1 β̂2 + γ2 β̂22 + γ3 σ22 + 2
...
rm = γ0 + γ1 β̂m + γ2 β̂b2m + γ3 σ1m + 1
and test
γ0 = 0, γ1 = rM , γ2 = 0, γ3 = 0.
There are several difficulties with the above procedure. First and foremost, stock returns are extremely volatile which reduces the precision of any test. In light of the
asset volatility, the security betas and expected returns are estimated with substantial sampling error. A possible improvement is that of grouping returns in portfolios
instead of using them one by one. A classic procedure based on this idea begins with
ordering the average returns by estimated beta value. The m average returns are then
grouped in, say, q portfolios and for each portfolio is computed the average beta for
the components of that portfolio.
The second step regression is then run using as dependent variables the average
of the returns for each portfolio and for regressor the average beta. Averaging should
decrease the sampling error implied by the market model regressions.
As written above, this course is not the place for a more detailed study of how to
test asset pricing models. There is neither place nor opportunity to discuss empirical
successes and insuccesses of asset pricing model (the theoretical and empirical literature
is huge and the debate rages on since the invention of CAPM almost 60 years ago).
Just a quick glimpse in the S&P 100 data used above. Regress each excess return
on the excess return of the market index and take the residuals for each of the 83
regressions. The question is if the residuals of this regression are “idiosyncratic”. While
the original excess returns are almost invariably positively correlated, residuals show
positive and negative correlations. A good measure of “overall correlation”, then, is the
sum of the squared element of the correlation matrix. If we take the ratio of the sum
of squares of the correlation matrix for the residuals and for the original data we get
.23. The simple beta model, then, reduces the measure of overall correlation to less
than one quarter of the original value. We deduce that, while other factors may be
necessary, the single beta model takes us a long way toward the separation between
systematic and idiosyncratic “risk”.
Asset pricing models are central both for asset management and for Corporate
Finance, for this reason they constitute a mainstay of Finance education.
For the interested Reader a good starting point would be: Kennet J. Singleton
“Empirical Dynamic Asset Pricing: Model Specification and Econometric Assessment”,
2006, Princeton University Press.
185
11.2
Estimates for B and F
When the factors F are observable variables, the matrix B can be estimated using OLS
(in fact a slightly better estimate exists but it is outside the scope of these notes).
This, in principle, is what we did with the style model which could, with some
indulgence, be considered as the “market modes” part of an asset pricing model.
In fact, for the style analysis method to work it is not strictly necessary that the
style model corresponds to a full market model. This is due to the fact that, in style
analysis, the model is used as a reference benchmark only.
The joint use of a benchmark model which is also a market model would, in any
case, be in theory a more coherent choice.
We also discussed this in the case of the CAPM. In the CAPM there exists a
single common factor, represented by the wealth of agents, intended as everything that
impacts agents utility, as risked on the market.
This cannot be directly observed and is proxied in practice by some market index
and m idiosyncratic factors supposed to be uncorrelated with the common factor and
among themselves. If we believe in the quality of the proxy for the wealth of agents,
an OLS estimate shall work also in this case.
The typical asset pricing model uses as factors some CAPM like index and observable macroeconomic variables, The Fama and French model is a CAPM plus two
long/short portfolios for value stocks (low against high price to book value) and size
stocks (later a momentum portfolio was added).
A huge academic industry in “finding relevant risk factors” to “explain the cross section of stock returns” (recall the second stage regression) arose from this with hundreds
of papers and suggested risk factors60 .
Current proactitioner models, widely used in the asset management industry for
asset allocation, risk management and budgeting and performance evaluation, include,
For those interested, read: Campbell R. Harvey, Yan Liu, Heqing Zhu, “. . . and the Cross-Section
of Expected Returns” The Review of Financial Studies, Volume 29, Issue 1, January 2016, Pages 5–68.
In this very interesting and funny paper the Authors attempt a wide review of the risk factors suggested
for market models in published papers up to 2015. They consider 313 papers and 316 different, but
often correlated, factors The Authors are very clear about the fact that this is, actually, not a complete
review of the published and unpublished research on the topic. The Authors summarize the results
and stress the important statistical implications due to the fact that, using, in the vast majority, data
on the US stock market or on markets correlated with this, these papers are not based on independent
experiments or observational data, but on what are, in essence, different time sections of the same
dataset. This is a classic case of the “data mining”, “multiple testing” or “exhausted data” problem,
sometimes also called “pretesting bias”. In this case many, in general dependent, tests are run on the
same dataset. Often, tests are chosen and run conditional to the result of other tests. This requires a
very careful assessment of the joint P-value of the testing procedure which cannot be reduced to a test
by test analysis. The result of such assessment is that individual test should be run under increasingly
stringent “significance” requirements when new hypotheses are tested in addition to old ones. This
quickly makes impossible to test new hypotheses on the same “exhausted” dataset.
60
186
for what is my experience, roughly from 10 to 15 risk factors and are tuned to specific
asset classes, so that they do not pretend to be general market models.
All these models can, in principle, be dealt with by regression methods.
There is, however, a different attitude toward factor modeling.
This attitude attempts a representation of underlying unobserved factors based on
portfolios of securities which are not defined a priori but jointly estimated with the
model optimizing some “best fit” criterion.
In order to do this, we need a joint estimation of F , the matrix of observation on
all factors, and B the factor weights matrix.
A common starting point is that of requiring the factors ft to be linear combinations
of excess returns: ft = rt L.
In principle there exist infinite choices for L. A unique solution can be chosen only
by imposing further constraints. Each choice of constraints identifies a different set of
factors.
Most frequently, factor models of this kind are based on the principal components
method or on variants of this.
The principal components method is a classic data reduction method for Multivariate Statistics which has received a lot of new interest with the growth of “big data”.
In Finance principal components are used at least starting with the nineteen sixties/seventies.
We can describe the procedure of “factor extraction” that is: the unique identification/estimation of factors, in two different but equivalent ways.
Both methods require, implicitly or explicitly, an a priori, maybe very rough, estimate of V (rt ). For this to be possible a fundamental assumption is that V (rt ) = V (r)
that is: the variance covariance matrix of excess total returns is time independent.
When this is not assumed to hold, more complex methods than simple principal
components are available but are well beyond the scope of these notes.
11.2.1
Principal components as factors
As a starting point, suppose that the variance covariance matrix of a 1 × m vector of
returns r: V (r) is known.
We introduce the principal components, at first, in an arbitrary way. In the following subsection we shall justify the choice.
From the spectral theorem we know that V (r) = XΛX 0 . By the rules of matrix
product and recalling that Λ is diagonal, we have:
X
XΛX 0 =
xi x0i λi
i
where xi is the i−th column of X and the sum is from 1 to m.
187
Notice that, in general, only k eigenvalues of λj are greater than 0 while the others
are equal to 0. Here k is the rank of V (r). For simplicity in the following formulas we
suppose k = m but with proper changes of indexes the formulas are correct in general.
Define the “principal components” as the "factors" (and remember; principal components are linear combinations of returns) fj = rxj and regress r on fj 61 . These are
m univariate regressions and the “betas” (one for each return in r) of this regressions
are, as usual62 :
βj =
E(x0j r0 r) − E(x0j r0 )E(r)
x0j V (r)
Cov(fj ; r)
=
=
V (fj )
V (fj )
V (fj )
However:
x0j V (r) = x0j XΛX 0 = x0j
X
xi x0i λi = x0j λj
i
and
V (fj ) = V (rxj ) = x0j XΛX 0 xj = λj
so that:
βj = x0j
Let us now find V (r − fj βj )
V (r − fj βj ) = V (r − rxj x0j ) = [I − xj x0j ]V (r)[I − xj x0j ] =
= [I − xj x0j ]XΛX 0 [I − xj x0j ] = [XΛX 0 − λj xj x0j ][I − xj x0j ] =
= [XΛX 0 − λj xj x0j − λj xj x0j + λj xj x0j ] =
= XΛX 0 − λj xj x0j =
X
0
xi x0i λi = X−j Λ−j X−j
,
i6=j
where X−j and λ−j are, respectively, the X matrix dropping column j and the Λ
matrix, dropping row and column j.
In other words, the covariance matrix of the “residuals” r − fj βj0 has the same
eigenvectors and eigenvalues of the original covariance matrix with the exception of
the eigenvector and eigenvalue involved in the computation of fj .63
Here the regression is to be intended as the best approximation of ri by means of a linear transformation of fj . The intercept is included, see next note.
62
Notice that the definition of βj here employed implies the use of an intercept. We have not
mentioned it here, since we are interested to the variance-covariance matrix of r, which is unaffected
by the constant. In any case, the value of the constant 1 × m vector α is E(r) − E(f )β = 0
63
A fully matrix notation makes the derivations even simpler, if less understandable.
61
188
This result is due to the orthogonality of factors64 and has several interesting implications. We mention just three of these.
First: one by one “factor extraction”, that is: the computation of f ’s and corresponding residuals, yields the same results if performed in batch or one by one.
Second: the result is invariant to the order of computation.
Third: once all factors are considered the residual variance is 0.
This last obvious result can be written in this way. If we set F = rX we have
r = F X 0 . Grouping in Fq and Xq the first q factors and columns of X and in Fm−q
and Xm−q the rest of the factors and columns of X we have:
r=
m
X
i=1
fi x0i
=
q
X
fi x0i
i=1
+
m
X
0
fi x0i = Fq Xq0 + Fm−q Xm−q
i=q+1
Which we are tempted to write as:
r = Fq Xq0 + e
Now recall the initial factor model (we drop the t suffix for the moment):
r = fB + It is tempting to equate Fq to f and Xq0 to B for some q. At the same time it is
tempting to equate e with 65 . Now, given the above construction, it is always possible
to build such a representation of r. The question is whether, given a pre specified
model r = f B + , the above described method shall identify f, B and . The answer
is: “in general not”.
In fact the two formulas are only apparently similar and become identical only
under some hypothesis. These are:
if we suppose V (r) invertible with eigenvector matrix X we have rX = F and immediately r = F X 0
so principal components are linear combinations of returns and vice versa. Moreover V (F ) = V (rX) =
X 0 V (r)X = X 0 XΛX 0 X = Λ that is: principal components are uncorrelated and each has as variance
.
the corresponding eigenvalues. Then, if we split X vertically in two sub matrices X = X1 ..X2 we


X10
..
..
.
have rX = F = rX1 .rX2 = F1 .F2 and r = F X 0 = F1 ..F2  ...  = F1 X10 + F2 X20 where
X20
0
0
V (F2 X2 ) = X2 Λ2 X2 . Since principal components are uncorrelated this implies that, whatever the
number of components in F1 , their regression coefficients shall always be the same and correspond to
the transpose of their eigenvectors (the first statement is a direct consequence of non correlation and
the second was demonstrated in the text.) In matrix terms: the “linear model” estimated with OLS:
r = F1 B̂1 + Û1 holds with B̂1 = X10 and Û1 = F2 X20 .
64
Orthogonality here means that the factors are uncorrelated.
65
It could be argued here that the expectation of e is not zero. Recall, on the other hand, that the
expected returns are typically nearer zero than most observed returns, due to high volatility. This is
particularly true when daily data are considered. Moreover the non zero mean effect is damped down
by the ”small” matrix Xm−q . Hence the expected value of e can be considered negligible.
189
1. The dimension of f is q.
2. V (f ) is diagonal.
3. BB 0 = I
4. The rank of V () is m − q and the maximum eigenvalue of V () is smaller than
the minimum element on the diagonal of V (f ).
To these hypotheses we must add the already mentioned requirement that f and are
orthogonal.
For any given f B the second and third hypothesis can always be satisfied if V (f B)
is of full rank. In fact, in this case, it is always possible, using the procedure described
above, to write f B = f˜B̃ where the required conditions are true for f˜B̃ (remember
that, if the f are unobservable, there is a degree of arbitrariness in the representation).
Hypothesis one is more problematic: all we observe is r and we do not know, a
priori, the value of q.
But the most relevant (and interesting) hypothesis is that the rank of V () is m − q
and its eigenvalues are all less than the eigenvalues of V (f B).
This may well not be the case and in fact we could consider examples where is a
vector orthogonal to the elements of f but V () is of full rank and/or its eigenvalues
are not all smaller than those of V (f B).
For instance: in classical asset pricing models (CAPM, APT and the like) the main
difference between residuals and factors is not that the variance contributed by the
factors to the returns is bigger than the variance contributed by “residuals” but that
factors are common to different securities, so that they generate correlation of returns,
while residuals are idiosyncratic that is: they should be uncorrelated across securities.
While principal component analysis guarantees zero correlation across different factors,
residuals in the principal component method are by no means constrained to be uncorrelated across different securities. In fact, since the varcov matrix of residuals is not of
full rank, some correlation between residual must exist and shall in general be higher
if many factors are used in the analysis66 .
While this is not the place for a detailed analysis of this important point, it is
useful to introduce it as a way for remembering that r = Fq Xq0 + E is, before all, a
If the row vector z of k random variables has a varcov matrix A such that Rank(A) = h. Then
at most h linear combinations of the elements of z can be uncorrelated. The proof is easy. Suppose
a generic number g of uncorrelated linear combinations of z exist and let these g linear combinations
equal to u = zG and suppose, without loss of generality, that by a proper choice of the G weights the
variance of each u is 1. Since the u are uncorrelated we have V (u) = G0 AG = Ig . Since the rank of
Ig is g, the rank of G, which is a k × g matrix is at most g and the rank of a product is less then or
equal to the minimum rank of the involved matrices we have that the rank of A would by necessity
be bigger than or equal to g but, by hypothesis, we know it to be equal to h so g cannot be bigger
than h (we could go on and show that it is in fact equal to h but we only wanted to show that AT
MOST h linear combinations could be uncorrelated).
66
190
representation of r and only under (typically non testable) hypothesis some estimate
of a factor model.
In our setting we need the representation in order to simplify the estimation of
V (r), while the interpretation of the result as the estimation of a factor model is very
useful, when possible, the simple representation shall be enough for our purposes.
It should always be remembered that our purpose is not the precise estimation of
each element of V (r). What we really hope for is a sensible estimate of the variance
of reasonably differentiated portfolios made with the returns in r. In this case, even if
the estimate of V (r) is rough it may well be that the estimate of a well differentiated
portfolio variance is fairly precise as, by itself, differentiation shall erase most of the
idiosyncratic components in the variance covariance matrix.
This intuitive reasoning can be made precise but it is above the purpose of our
introductory course.
A last point of warning is required. If we use enough principal components, then
Fq Xq0 behaves almost as r (the R2 of the regression is big). The almost clause is
important. Suppose you invest in a portfolio with weights xq+1 /xq+1 1m that is, a a
portfolio with correlation 1 with the first excluded component (the denominator of the
weights is there in order to have the portfolio weights sum to 1). By construction the
variance of this portfolio is λq+1 /(xq+1 1m )2 . However the covariance of this portfolio
with the included components is zero. In other words: if we measure the risk of any
portfolio by computing its covariance with the set of q principal components included
in the approximation of V (r) we shall give zero risk to a portfolio correlated with one
(or many) excluded components.
The practical implications of this are quite relevant but, a thorough discussion is
outside the purpose of these handouts. However: beware!
The question now is: we introduced the factor/components F in an arbitrary way
deriving them from the spectral theorem. Are there other justifications for them?
11.3
Maximum variance factors
In the preceding section we derived a principal component representation of a return
vector by comparing the spectral theorem with the general assumptions of a linear
factor model.
Here we follow a different path: we characterize each principal component (suitably
renormalized) as a particular “maximum risk” portfolio with the constraint that each
component must be orthogonal to each other component and that the sum of squared
weights should be equal to one.
Linear combinations of returns are (up to a multiplicative constant) returns of
(constant relative weights) portfolios67 . Given a set of returns it is interesting to answer
As hinted at in several places of these handouts, given a linear combinations of returns, there exist
at least two ways of converting this in the returns of a portfolio. If we only want the required portfolio
67
191
the question: which are the weights of the maximum variance linear combination of
returns? (We repeat: this is not the same of the maximum variance portfolio).
This problem is not well defined as the variance of any portfolio (provided it is not
0) can be set to any value by multiplying its weight by a constant.
It could be suggested to constrain the sum of weights to one, however this does
not solve the problem. Again, By considering multiples of the different positions the
requirement can be satisfied and the variance set to any number, at least, if weights
are allowed to be both positive and negative.
A possible solution is to set of absolute values of the weights to one. This would
both solve the problem and have a financial meaning. Alas, this can be done but only
numerically.
Suppose instead we set the sum of squared weights to 1. This solves the bounding
problem with the inconvenient that the resulting linear combination shall in general
not be a portfolio. But this choice yields an analytic solution.
Let us set the mathematical problem:
max V (rθ) = max
θ0 V (r)θ
0
θ:θ0 θ=1
θ:θ θ=1
The Lagrangian for this problem is:
L = θ0 V (r)θ − λ[θ0 θ − 1]
So that the first order conditions are:
V (r)θ − λθ = 0
and
θ0 θ = 1
Rearranging and using the spectral theorem we have:
[XΛX 0 − λI]θ = 0
We see that, if we set θ = xj and λ to the corresponding λj , for any j we have a solution
of the problem. Since V (rXj ) = λj , the solution to the maximum variance problem is
given by the pair x1 and λ1 where, as usual, we suppose the eigenvalues sorted by size.
to be perfectly correlated with the given linear combination, all that is needed is to renormalize the
weights by dividing them for their sum (provided this is not zero). If we wish for a portfolio with
the same weights (on risky assets) and the same variance as those of the linear combination, we must
simply add to the linear combination the return of a risk free security with as weights the difference
between one and the sum of the linear combination’s weights. Notice that in this second case, while
the (one time period) variance of the linear combination shall be the same as the variance of the
portfolio return (the risk free security has no variance for a single time period) the expected value
shall be different. In fact, if the weight of the risk free security is greater than zero, the expected value
of the portfolio return shall be (with a positive return assumed for the risk free security) greater that
the expected value of the linear combination of returns, the opposite in case of a negative weight
192
From what discussed in the previous section, the other solutions can be seen as the
maximum variance linear combinations of returns, where the maximum is taken with
the added constraint of being orthogonal to the previously computed linear combinations.
We see that the components defined in a somewhat arbitrary way in the previous
section become now orthogonal (conditional) maximum variance linear combinations.
11.4
Bad covariance and good components?
Suppose now that V (r) is not known. In particular our problem is to estimate such a
matrix when m, the number of stocks, is big (say 500-2000). What we wrote up to this
point suggests a way for simplifying a given variance covariance matrix using principal
components. What happens when the variance covariance matrix is not given and we
must estimate it?
Obviously we could start with some standard estimate of V (r). For instance,
suppose we stack in the n × m matrix F our data on return and estimate V̂ (r) =
F 0 F/n − F 0 1n 10n F/n2 where 1n is a column vector of n ones. Then we could proceed
by extracting the principal components from V̂ (r).
It could be a puzzle for the reader the fact that, in order to estimate the factor
model, whose purpose is to make it possible a sensible estimate of the covariance matrix,
we need some a priori estimate of the same matrix. A complete answer to this question
is outside the scope of these notes (this sentence appears an annoying number of times.
Doesn’t it?), however, the intuition underlying a possible explanation is connected with
the fact that, in principle, the principal components could be computed without an
explicit a priori estimate of V (r). Given a sample of n observations on rt that is R,
all that is needed is, for instance in the case of the first component, to find the vector
x1 of weights such that the numerical variance of f1 = Rx1 is maximum (with the
usual constraint x01 x1 = 1. This can be done iteratively for all components. The idea
is that, even if the full V (r) is difficult to estimate, it may be possible to estimate the
highest variance components while the estimation problems are concentrated on the
lowest variance components.
More formally: we estimate V (r) with some V̂ (r) = V (r) + were is a positive
definite
error matrix.
the spectral decomposition for both matrices as : V (r) =
P
P Write
0
0
j xj xj λj and =
j ej ej ηj . Our hope is that the highest of the error eigenvalues ηj is
smaller of at least some of V (r) eigenvalues. In this case the estimate error shall affect
the overall quality of the estimate V̂ (r) but only with respect to the lowest eigenvalue
components.
In summary. The principal components are defined as ft = rt X where X are the
eigenvectors of the return covariance matrix. The principal components are uncorrelated return portfolios (recall that a constant coefficients linear combination of returns
is the return of a constant relative proportion strategy, moreover recall that the sum of
193
weights in the principal component portfolios is not one). The variances of the principal component are the eigenvalues corresponding to the eigenvectors which constitute
the portfolio weights. We can derive a solution to the problem rt = ft B by simply
setting B = X 0 . The percentage of variance of the j−th return due to each principal
component can be computed by taking the square of the j−th column of X 0 and dividing each element of the resulting vector for the total sum of squares of the vector
itself.
A simple PC analysis on a set of 6 stock return series can be found in the file
“principal components.xls”. A more interesting dataset containing total return indexes
for 49 of the 50 components in the Eurostoxx50 index (weekly date) can be found
in the file “eurostoxx50.xls”. Principal components were computed using the add-in
MATRIX.
Examples
Exercise 11 - Principal Components.xls Exercise 11b - PC, Eurostoxx50.xls
12
Black and Litterman
The direct estimation, based on historical data, of the expectation vector for a set of
stock returns, seems to be doomed to failure in any conceivable real world circumstance.
As considered in a previous section, this is mainly due to the fact that the typical
estimate has a standard error, in yearly terms, which is in the range of 25-35% divided
by the square root of the number of yearly data available in the dataset (notice that the
use of daily, monthly or yearly data makes no difference, at least for log returns. For
an estimate to show a 95% confidence interval of reasonable size (say ±5%) we should
then use data for a number of years in the range of 100-120 and, for many reasons, this
is quite an unlikely sample size. Typical sample sizes for reliable and homogeneous
data on single stocks are 5-10 and these imply 95% confidence intervals sizes in the
range of ±20%.
For this reason any direct use of portfolio optimization methods of the Markowitz
kind shall imply unreasonable allocations: too much sampling error in the estimation
of the expected return and, due to this, an asset allocation which shall be idiosyncratic
to the specific estimation sample.
This is in fact what we observe in practice: from the direct use of historical data
we derive allocation weight wildly varying with time, completely different across stocks
and, often, unreasonably extreme. The ex post, i.e. derived from historical estimates,
optimal portfolio seems to be highly specific to the time of estimation. Accordingly,
the optimal allocation with the Markowitz model changes a lot from time to time while
194
the market allocation, that is the relative total capitalization of traded assets is neither
extreme nor so wildly time varying.
A possible solution is: do not bother with optimization and use some standard
allocation rule. This could be simply the market allocation, as approximated by some
wide scope index, or some equally weighted portfolio or any other reasonable choice.
Obviously, due to the same problem described above, the expected value of any allocation containing a sizable stock proportion shall be difficult to estimate. Notice,
moreover, that any problem is estimating the expected value has two faces: first it is
difficult to estimate the parameter but second, and most important, even when the
parameter were known, it is very difficult to assess which average return you shall get
out of your investment on a reasonable period of years.
For this second problem we can do noting. For the first problem some useful
suggestion was made during the nineties. The most widely implemented suggestion is
the so called Black and Litterman model.
The basic idea is that of using as a starting point the market allocation and express
our views as departures from this allocation. The main contribution of the method it
to discipline the asset manager action. The asset manager is required to numerically
specify the extent and the confidence of his/her views and the effect of this specification can be immediately checked in terms of departures from a reasonable (the
market’s) allocation. This helps in avoiding what often can be seen in practice: wrong
implementation of reasonable vies.
Suppose that the market allocation is observable, at least for aggregate asset classes,
and suppose that this allocation is a mean variance optimal allocation. This is obviously a rather strong hypothesis, however, the portfolio of the market (not the CAPM
market portfolio) should at least be reasonable that is, rationally held, and a minimal
requirement for this should be not too much mean variance inefficiency. Then if we
make an hypothesis on the variance covariance matrix of asset returns (typically a data
based estimate) and on the market risk aversion parameter we can invert the vector of
market weights in order to find the implied vector of market expectations. In formulas, since the tangency portfolio weights, renormalized in order to have a total sum of
weights given by 1, are:
wmkt = λΣ−1 (µmkt − 1rf )
we have
Σwmkt
+ 1rf
λ
This procedure is equivalent to implied volatility computation using quoted option
prices and the Black and Scholes model. In both cases the result would be irrelevant
had it not to be used for some further modeling: in the case of implied volatility,
the market prices of options being known, the estimate is useful for computing other
derivative prices or for evaluating hedges, in the case of Black and Litterman the
market portfolio is required in order to compute the market expectations vector so,
µmkt =
195
in absence of constraints of other information the investor should reasonably hold the
market portfolio itself. In this case the knowledge of the market implied expectations
would be irrelevant. On the contrary the idea of Black and Litterman is to conjugate
the market implied expectations with further information (and, possibly, constraints)
in order to derive a strategic allocation that can differ from that of the market.68
In Black and Litterman the private and market information of the asset manager
are expressed as a distribution for the vector of expected returns: µR which is supposed
to be a random vector. The private information is expressed by a set of views. In other
words, the asset manager assess q expected values and variances of a set of q linear
combinations (portfolio returns) of µR . On the other hand the market information is
expressed by assessing that the expected returns of the assets in the market portfolio
are the random vector µR with expected value equal to the market implied expected
return and varcov matrix equal to a fraction of the observed returns varcov matrix.
In formulas: let R be the k × 1 random vector of market returns with expected
values µR , Σ its varcov matrix, P the q × k matrix for the weights of the q portfolios
on which the asset manager expresses views.
The market information is summarized by the hypothesis:
µR ≈> Nk (µmkt ; τ Σ)
here τ is a scalar smaller than one.
The asset manager information is expressed by the hypothesis:
P µR ≈> Nq (V, Γ)
where V is a vector of expected values for the view portfolios and Γ a diagonal varcov
matrix expressing the confidence in these views.
We must now derive and expectation vector, say: µBL combining both market and
private information to be used for portfolio optimization.
There are different possible ways of deriving the Black and Litterman formula for
the µBL vector. A simple way is to solve the following optimization problem:
µBL = arg min(µ − µmkt )0 (τ Σ)−1 (µ − µmkt ) + (P µ − V )0 (Γ)−1 (P µ − V )
µ
The interpretation of this problem is simple: we want to find the value of µ which
minimizes the distance (weighted with the inverse of covariance matrices) with respect
to the market implied value but also respects as well as possible the views of the fund
The reader shall notice some lack of coherence in this attitude. The use of the market portfolio
and the hypothesis that this portfolio is derived by mean variance optimization implies some degree
of uniform market information. On the other hand, we suppose also that there exists some private
information which the asset manager can pool with the market implied expectation. A possible
justification of this, albeit quite informal, is in the fact that the asset manager portfolio size could be
negligible with respect to the market.
68
196
manager. In more technical terms we reduce our problem to a weighted least squares
problem. In the limit, when the diagonal elements of Γ go to zero, that is, when there
is infinite confidence in the views, the problem becomes equivalent to a constrained
least squares problem. On the other hand, when Γ has diagonal elements diverging to
infinity (no confidence in the views), the solution to the problem is simply µ = µmkt .
In order to better understand the way in which the two terms in the objective
function are weighed it is useful to rewrite the objective function as:
µBL = arg min
µ
b
a
(µ − µmkt )0 (A)−1 (µ − µmkt ) +
(P µ − V )0 (B)−1 (P µ − V )
a+b
a+b
where A and B are “numeraire” varcov matrices, that is: the τ Σ and Γ varcov matrices
whose terms have been divided by the sum of the terms on the diagonal. In other terms
these matrices express variances in relative terms. a and b are, resp., the diagonal sum
for the terms in (τ Σ)−1 and (Γ−1 ). Using these definitions it is easy to see that the
weight of each of the two terms in the quadratic form is given by the relative value of
a resp. b.
In order to find µBL we begin by taking the first derivative of the objective function
w.r.t. µ:
∂L
= 2(τ Σ)−1 (µ − µmkt ) + 2P 0 (Γ)−1 (P µ − V )
∂µ
we set this to 0:
(τ Σ)−1 (µBL − µmkt ) + P 0 (Γ)−1 (P µBL − V ) = 0
((τ Σ)−1 + P 0 (Γ)−1 P )µBL − ((τ Σ)−1 µmkt + P 0 (Γ)−1 V ) = 0
µBL = ((τ Σ)−1 + P 0 (Γ)−1 P )−1 ((τ Σ)−1 µmkt + P 0 (Γ)−1 V )
This last is the celebrated Black and Litterman formula and gives the value of the
vector of expected returns which optimally (in the distance sense considered above)
pools the market and the fund manager opinions.
The Black and Litterman formula can be written in a slightly different and interesting way by exploiting the “matrix inversion lemma” or by direct tedious computation.
The result is 69
µBL = µmkt + (τ Σ)P 0 (P τ ΣP 0 + Γ)−1 (V − P µmkt ) = µmkt + K(V − P µmkt )
69
The tedious computations are as follows
µBL = ((τ Σ)−1 + P 0 (Γ)−1 P )−1 ((τ Σ)−1 µmkt + P 0 (Γ)−1 V ) =
= ((τ Σ)−1 + P 0 (Γ)−1 P )−1 (τ Σ)−1 τ Σ((τ Σ)−1 µmkt + P 0 (Γ)−1 V ) =
remember now that A−1 B −1 = (BA)−1
= (I + τ ΣP 0 Γ−1 P )−1 (µmkt + τ ΣP 0 Γ−1 V ) =
197
While the first formula expresses µBL as a weighted average of the market expected
values vector and of the fund manager opinions, this second formula implies that the
tangency portfolio which takes in to account both market and private information is
given by the market portfolio plus a “spread position”. In fact, using the usual formula
for the optimal portfolio we have:
wBL = gλ(Σ−1 (µmkt − rf ) + Σ−1 K(V − P µmkt ))
where g = 1/λ(10 Σ−1 (µmkt − 1rf ) + 10 Σ−1 K(V − P µmkt )) is a constant which makes
the sum of weights of the portfolio equal to 1.
This is the algebra of the Black and Litterman model. Much more relevant than
this is the choice of inputs which strongly influence the final allocation.
12.0.1
The market portfolio
In principle the market portfolio should consider the total value of each traded asset
that is the financial wealth of all agents. To leave out assets or to aggregate assets in
portfolios implies an incorrect evaluation of µmkt . In practice it is obviously impossible
to use all marketable assets: just as an instance this should require the need for the
estimation of a huge variance covariance matrix of thousands of rows, an impossible
task.
The current practice is that of either choosing a detailed analysis (asset by asset) of
a single market or market subset or to use the Black and Litterman model for strategic
allocations among aggregated asset classes. Much could be said pro and con each
choice. The only point we want to stress here is that the market expected return for
each asset derived by the Black and Litterman model strongly depends on the choice
of the market portfolio proxy. In other words: the market expected return for the same
asset or asset class shall differ if this asset or asset class shall be introduced in different
proxies of the market portfolio. This could imply some degree of incoherence if the
model is used for helping in the allocation of many, partially overlapping, portfolios.
= (I + τ ΣP 0 Γ−1 P )−1 (µmkt + τ ΣP 0 Γ−1 V + τ ΣP 0 Γ−1 P µmkt − τ ΣP 0 Γ−1 P µmkt ) =
= (I + τ ΣP 0 Γ−1 P )−1 ((I + τ ΣP 0 Γ−1 P )µmkt + τ ΣP 0 Γ−1 (V − P µmkt )) =
= µmkt + (I + τ ΣP 0 Γ−1 P )−1 τ ΣP 0 Γ−1 (V − P µmkt ) =
= µmkt + (I + τ ΣP 0 Γ−1 P )−1 τ ΣP 0 Γ−1 (Γ + P τ ΣP 0 )(Γ + P τ ΣP 0 )−1 (V − P µmkt ) =
= µmkt + (I + τ ΣP 0 Γ−1 P )−1 (τ ΣP 0 + τ ΣP 0 Γ−1 P τ ΣP 0 )(Γ + P τ ΣP 0 )−1 (V − P µmkt ) =
= µmkt + (I + τ ΣP 0 Γ−1 P )−1 (τ ΣP 0 + τ ΣP 0 Γ−1 P τ ΣP 0 )(Γ + P τ ΣP 0 )−1 (V − P µmkt ) =
= µmkt + (I + τ ΣP 0 Γ−1 P )−1 (I + τ ΣP 0 Γ−1 P )τ ΣP 0 (Γ + P τ ΣP 0 )−1 (V − P µmkt ) =
= µmkt + τ ΣP 0 (Γ + P τ ΣP 0 )−1 (V − P µmkt )
198
Notice the difference with the Markowitz result: in the Markowitz model we find
the “best” mean variance portfolio for a given set of assets. It is not surprising, and
completely coherent, that the weight of a given asset changes if the other assets in the
portfolio to be optimized change. However, the market portfolio, under the CAPM
model, is a mean variance portfolio only if taken as a whole. Sub portfolios of the
market portfolio are not, in general, mean variance portfolios. Hence, the use of sub
portfolios of the market portfolio for deriving from these the market expectation vector
as if they were optimal mean variance portfolios could lead to a biased evaluation
(question: when it is the case that a sub portfolio of a mean variance portfolio is still
mean variance optimal?)
A possible way out of this problem is a deeper use of the CAPM theory. We know
that the CAPM basic equation is:
E(ri ) − rf = λβi
m −rf ;ri −rf )
, rm is the return of the
where ri is the return for the ith asset, βi = cov(r
V ar(rm −rf )
market portfolio and λ(not to be confused but related with the parameter with the
same name in the Black and Litterman model) is the market price of risk in units of
the market portfolio excess expected return, that is λ = E(rm − rf ).
This result suggests this procedure: choose a large cap weighted index as the market
portfolio, estimate the betas only for those assets implied in your asset allocation, chose
a value for λ and compute the expected excess return for each asset involved in the
asset allocation. Following this procedure it is possible to evaluate the market implied
excess expected returns of the assets of interest in a coherent way, in the sense that the
valuation only depends on the market index used as a proxy of the market portfolio but
does not require to compute the variances and covariances of all the assets in the index.
We could then use a very exhaustive index and still maintain numerical manageability
to our problem.
It is to be told that this procedure, while useful, tends to give results which are
more unstable than the standard Black and Litterman procedure, in particular when
we need to evaluate the market expected excess return for assets with low correlation
with respect to the market portfolio. In this case a possible alternative is in the use of
multifactor models.
12.0.2
The estimation of Σ.
The variance covariance matrix of returns: Σ, is, in general, estimated from the data.
The typical estimate of Σ is the standard statistical estimate based of three or more
years of monthly or at most weekly data. Sometimes a smoothed estimate analogous
to the one we described for the variance is used.
A characteristic problem in the estimation of the variance covariance matrix which
has a huge influence on portfolio optimization is the overestimation of covariances. This
199
may have dire consequences. Suppose you are considering in your portfolio just two
assets. Their expected values are very similar and such are their variances. Suppose
now their correlation not to be extreme, in this case even if the expected values are
not the same the optimal investment shall share its weights oh both assets. Suppose,
instead that the correlation is near one. In this case there is no advantage in diversification and the weight shall concentrate almost completely on the asset with highest
expected value.
Overestimated correlations, join with bad estimates of expected values are one of
the source of observed instability in optimal mean variance allocations. Obviously, if
the correlations are really high, there is no problem in allocating the full weight on
the highest expected value asset. On the other hand, due to the similarity between
expected returns, a shared allocation would not be badly suboptimal. Suppose, on the
contrary, that the covariance was overestimated. In this case to concentrate the full
weight on one of two, very similar in expectation terms, assets would be a bad mistake:
we would lose the chance of decreasing the overall variance thanks to diversification.
Working on this intuition we could suggest a simple “shrunk” version of the standard estimate for the variance covariance matrix which shrinks it toward an average
covariance. Start with the standard estimate of the variance covariance matrix and
derive from this the estimate of the correlation matrix and the estimate of the vector
of variance. Let Ω be the estimated correlation matrix. From this derive the shrunk
estimate αΩ + (1 − α)Θ where Θ is a reference correlation matrix, typically a matrix
with ones on the diagonal and the average of the off diagonal elements of Ω as correlation terms (more complex structures can be considered when we are analyzing data
from more than one market). The typical α shall be a number in the range of .8,.9.
The resulting estimate shall be converted into a covariance matrix by composing it
with the estimates of the variances and shall show less extreme covariances than the
original estimate.
12.0.3
The risk aversion parameter λ
If the market portfolio does not contain the riskless asset and if its weights sum to 1
then, as seen in the Markowitz section, we have:
λ=
V (rm )
E(rm ) − rf
In principle this parameter could be estimated using a very long time series for an
exhaustive market index. In practice, for the usual reasons, only the numerator of λ
can be estimated in this way, for the denominator we shall need some a priori guess.
200
12.0.4
The views
Where do the views come from? Wish I knew. All I can tell you is how to express your
views, if any (otherwise the use of a market portfolio is never a bad choice).
In Black and Litterman a view is a specification of the expected return and variance
over a given period of time of a portfolio. There exist, broadly, two types of views:
absolute and relative. An absolute view is a view on a portfolio where the net position
is long or short, a relative view is a view on a portfolio whose sum of weights is zero.
Algebraically the weights of the portfolio on which the fund manager expresses views
are written in the rows of a matrix P each row represents the weights of a different
portfolio and each column the weights for the same asset in different portfolios. The
expected value for each view is written in the column vector V = E(P µR ) and the
variance for each view in the diagonal elements of the matrix Γ. The correlation
between different views is supposed to be 0. Notice that this does not imply that the
correlation between the returns of different view portfolio is zero but only that the
correlation between the elements of P µR is zero.
In order to specify the views the fund manager could simply specify the extremes
A, B of an interval in which each row of P µR is believed to fall with probability of about
95%. With this information the expected value of the j-th view could be estimated as
√
vj = (Aj + Bj )/2 and the standard deviation as γj = (vj − Aj )/2.
Notice that from the formula:
µBL = µmkt + (τ Σ)P 0 (P τ ΣP 0 + Γ)−1 (V − P µmkt ) = µmkt + K(V − P µmkt )
we see that the effect of a view depends on the difference between the view expected
value V and the market valuation of the same: P µmkt . For instance. Suppose that
a view only involves two assets and the view portfolio is short one asset and long the
other with identical weights. Suppose that the view expected value is positive. This
does not in general imply that, by the effect of this view, the resulting µBL shall display
a difference between the expected values of the two assets bigger than that to be found
in µmkt . This shall be true only if the difference between the expected values of the
two assets in µmkt is smaller that hypothesized in V . It is then very useful, during the
process of view specification, to always compare V with P µmkt .
12.0.5
The choice of τ
Here we have very little to say. As mentioned above the choice of τ is relevant only
relative to the choice of the elements in Γ. In fact, if we multiply both τ and Γ for the
same scalar the Black and Litterman formula does not change. The practical meaning
of τ is that of transforming Σ, the estimate of the varcov matrix of the return vector R
into an estimate of the varcov matrix for µR , the expected return of the return vector.
Typical values chosen in the available examples are of the order of 1/3 meaning that
1/3 of the variances of the elements of R is due to a random variation of µR .
201
12.0.6
Further readings
Lots of applied papers were written about the Black and Litterman model and its
extensions. Excel and Matlab implementations also abound. A possible reference
point is the webpage http://www.blacklitterman.org/ by Jay Walters.
Examples
Exercise 12 - Black and Litterman.xls
13
Appendix: Probabilities as prices for betting
In this appendix we give a very small hint of three “formidable” topics: the definition
of probability in betting systems; the connection between probability and frequency
and the big question concerning whether actual people decisions can be described with
probability models. It is obviously impossible to deal with these topics in such small a
space but it is in any case useful to point them out for future study.
13.1
Betting systems
Here we recall the betting system definition of the probability P (A) of an event A as
the price to pay for buying a ticket which results in a win of 1 if A happens and 0 if
A does not happen. If we suppose a set of such bets is made on a class of events and
any finite combination of events can be bet on or against (in this second case P (A) is
the price received for selling the ticket) then avoidance of arbitrage (the existence of
combinations of bets which guarantee a win with no cost, what is known as a “dutch
book”) implies P must satisfy the properties traditionally assumed for a probability
(assigned to a a finite set of events).
For instance, suppose you are betting on the event A and the price you pay for this
bet is P (A). This must be between 0 and 1. In fact, suppose P (A) > 1, then I accept
you bet and in the case A is true i give you 1 with the net gain of P (A) − 1 > 0 while,
if A turns out to be false, i keep your P (A). In both cases I am off with a positive
gain. This is an arbitrage or, in betting language, a “dutch book”. Suppose now you
are betting on both A and A and the prices you are willing to pay for these bets, P (A)
and P (A) do not sum to 1. Suppose, for instance, they sum to less than 1. Then, since
at this price you are both receiving or making bets, I pay you that prices and bet on
both events. Since either A or A is going to be true, I am getting in any case 1, and I
paid less than 1 for this.
A set of such no arbitrage prices are called “coherent” probabilities and we can
proof in very simple way that, if the set of events we bet on constitute an algebra,
all properties required to probabilities in an abstract definition of probability must be
202
satisfied. There is a single exception: countable additivity cannot be justified in this
setting.
This idea of no arbitrage in betting systems is an exact analogue of the modern
theory of no arbitrage in financial markets which, however, is a little more general.
First: in markets you have time, and time has a value which is usually positive
(the interest rate). If the price for the bet is paid today, while the bet is settled in the
future, the bettor shall require for the price of the bet to be discounted with the value
of time so that, for instance, in the case of the bets on A and A the “spot” prices of the
bets shall sum to something less than 1. If pay and settle the bet in the same future
day, the prices for the bets shall require no time value (they are now forward prices)
and they’ll sum to 1.
Second: with simple bets the two possible values for the payoff are 0 or 1. This
may be the case for the payoff of some security (as, for instance, digital options) in
the general case, however, future values of securities in financial markets may be, in
principle, generic real numbers.
This kind of problem has been taken into account by probabilists since, at least,
two centuries and the extension to this setting of the betting definition is this: suppose
you are considering bets i = 1, ..., m on m quantities whose future values shall be Xi .
You pay the price (or receive the price) P (Xi ) for betting (receiving the bet) whose
payoff is going to be Xi .
If you are willing to bet or receive bets of this kind and avoid arbitrage, it is possible
to show that P (Xi ) must satisfy the properties we usually require to the mathematical
expectation (at least in the discrete or bounded X case).
As it is well known, the expected value of a 0, 1 valued random variable is the
probability with which the random variable takes the value of 1 and we can represent
any event A with such a variable (called the indicator function of A). So in this
particular case, the two settings coincide.
13.2
Probability and frequency
Now, a point of interest. While all this is quite intuitive, obviously, it does tell us nothing about the values we should choose for these P , provided no arbitrage is involved.
Can we say something more about this?
Here things become very fuzzy.
For instance: we are used to think that probability has something to do with
frequency. What is the connection of this with frequency?
Attempts at connecting a definition of probability to limits of frequencies have been
made but are difficult to justify, after all limits are very metaphysical objects.
This not withstanding, it is a widespread notion that any definition of probability should yield a probability calculus whose algebra has to be valid when applied to
frequencies of events in finite replications of experiments. This stems from the re203
quirement that probability models should be, in appropriate situations where repeated
experiments are possible, useful in describing possible future frequencies.
The great mathematician F. P. Ramsey, in a path breaking essay (“Truth and
Probability” 1927) states this point in an admirably clear way:
“I suggest that we introduce as a law of psychology that [the subject’s]
behaviour is governed by what is called the mathematical expectation; that
is to say that, if p is a proposition about which he is doubtful, any goods
or bads for whose realization p is in his view a necessary and sufficient
condition enter into his calculations multiplied by the same fraction, which
is called the “degree of his belief in p”. We thus define degree of belief in a
way which presupposes the use of the mathematical expectation. We can
put this in a different way. Suppose his degree of belief in p is m/n; then
his action is such as he would choose it to be if he had to repeat it exactly
n times, in m of which p was true, and in the others false.... This can also
be taken as a definition of degree of belief, and can easily be seen to be
equivalent to the previous definition.”
Where “the previous definition” is Ramsey’s version of no arbitrage in betting system
definition of (subjective or personal probability).
So, according to Ramsey, probability and relative frequencies should (and do) share
the same mathematical rules because the personal probability of an event made by
decision maker can be interpreted as a forecast of the relative frequency with which
the event shall be true in a given set of hypothetical future experiments in the sense
that the decision maker action taken on the basis of such a probability “is such as he
would choose it to be if he had to repeat it exactly n times, in m of which p was true,
and in the others false”.
A great Italian mathematician, Bruno deFinetti, gave a proof of a result which, in
a very particular case, connects probabilities and frequencies.
Suppose you have a sequence of 0, 1 valued random variables and suppose that for
each choice of n of these, the probability you give to this sequence depends only on the
number of zeros and ones in the sequence (in technical words you say that these random
variables are exchangeable). Then, the bigger is n, the nearer should your evaluation
of the probability for a sequence containing any given number of 0s and 1s be to the
product of the probabilities for each element in the sequence. Moreover, the bigger is
n the nearer should be the probability of, say, 1, you use for a 1 in that given sequence,
to the relative frequency of 1s in that sequence. In other words, exchangeability means
approximate independence, conditional to the relative frequency of 0s and 1s and an
approximate value of the probability of 0 or 1 equal to the frequency of 0 or 1 in the
given sequence.
This result tells us that in some very specific but relevant setting, a probability
statement must converge to a relative frequency, so that, in such setting, there exists
204
a connection between values of probabilities and values of frequencies and not only
between the rules of probability and the rules of frequency.
13.3
Probability and behaviour
A last, formidable topic.
We use probability models for describing games and markets.
In our probability models we require no arbitrage. Do gamblers/investors behave
in a way which agrees with this definition of probability?
At the times of Ramsey, it was already know that true behaviour of decision makers
in gambling situations did NOT satisfy the basic axioms of probability (and this is true
when we study investor behaviour too). The standard algebra we use for computing
probabilities should then be considered as a NORMATIVE description of decision
making under uncertainty not as a POSITIVE description of behaviour of animal spirit
motivated real world obfuscated market agents.
In the recent past a lot of research effort has been spent in trying to formalize
real world, irrational (that is: arbitrage ridden and incompatible with relative frequencies) decision making behaviour using things the like of sub additive probabilities
(P (A) + P (Ā) ≤ 1). This choice, obviously, makes probability not a good model for
frequencies (which by definition add up to 1). However, the judgment of the possibility
and usefulness of a clever mathematical description of irrational behaviour is better
left to the Reader, together with the much more practically relevant question of the
opportunity or even possibility on behaving rationally when others do not.
14
Appendix: Numbers and Maths in Economics
There is an age old debate on the use of Maths in Economics, dating back at least
to the 17th century. The interested student could read, for instance, the introductory
papers in the first number of Econometrica (1933) and see an interesting trace of how
this debate was intended by the soon to be “winner” side.
In particular it shall be useful first to read the position of Joseph Schumpeter and
than the position of Ragnar Frisch.
Schumpeter writes:
There is, however, one sense in which Economics is the most quantitative, not only of ’social’ or ’moral’ sciences, but of all sciences, Physics
not excluded. For mass, velocity, current, and the like can undoubtedly
be measured, but in order to do so we must always invent a distinct process of measurement. This must be done before we can deal with these
phenomena numerically. Some of the most fundamental economic facts, on
the contrary, already present themselves to our observation as quantities
205
made numerical by life itself. They carry meaning only by virtue of their
numerical character. There would be movement even if we were unable to
turn it into measurable quantity, but there cannot be prices independent
of the numerical expression of every one of them, and of definite numerical
relations among all of them.
Here it seems that the simple fact that economic data are mostly recorded in numbers
implies that there must exist some relevant “mathematical” structure is behind them.
This is opposed to the supposedly more artificial introduction of numbers in Physics
through the invention of processes of measurement. Obviously things are not so simple
and the opposite view is more likely to be a correct one: even if some phenomenon
is recorded as a quantity, it could be the case that no simple mathematical structure
can be applied to such quantities in order of revealing some relevant structure of the
phenomenon. In the field of Finance, for instance, the fact hat we observe a time series
of prices for a given asset does not by itself mean that, say, we can apply time series
analysis to this data. For this to be possible it is not necessary nor sufficient that data
be expressed as numbers. What is relevant is that, after some translation of data to
numbers (but this is by no meas necessary) we can state that for some cogent reason,
maybe based on the description of the systems which generates them, these numbers
satisfy some probabilistic model of the class suggested by time series analysis.
The introduction of quantities, when relevant because satisfying some algebra as,
for instance, in Physics, is usually not the beginning but the end of a long conceptual
evolution striving to asses in some useful way aspects of real world phenomena involving, say, the movement of bodies, which satisfied some simple rules which could be
dealt with mathematical models. In other words: quantities are defined through mathematical models but quantities by themselves do not imply mathematical models. The
wonderful quality of abstraction and modeling effort which resulted in the concept of
inertial mass and in its central role in successful mathematical modeling the cinematic
properties of completely different objects (which could have been, and actually were,
“measured” in many other totally unproductive ways) could be a illuminating reading
for any student.
And here we quote the Editor’s Note by Ragnar Frisch (surely more quantitative
minded than Schumpeter) in the same number of the Journal. In it, the reasons and the
problems of the use of Mathematics (and most important for this Author, Statistics)
are clearly delineated. Mathematics is seen as a tool for speaking with rigor and for
deriving testable hypotheses in order to analyze these with the use of Statistics, but:
“Mathematics is certainly not a magic procedure which in itself can solve
the riddles of modern economic life, as is believed by some enthusiasts. But,
when combined with a thorough understanding of the economic significance
of the phenomena, it is an extremely helpful tool.”
206
The debate goes on and, as it is frequently the case, its “going on”, not its arguable
solution, is the source of its usefulness.
15
Appendix: Optimal Portfolio Theory, who invented
it?
A theory of optimal reinsurance which, as a particular case, include optimal mean variance portfolio analysis was described in detail bay Bruno deFinetti in a very famous (in
Europe and among actuaries) prize winning paper “Il problema dei pieni” (1940) Giornale dell’Istituto Italiano degli Attuari. This was 12 years before Roy and Markowitz
papers.
In several occasions Italian academicians tried to point out deFinetti priority with
respect to portfolio theory, both before and after Harry Markowitz Nobel award of
1990. This was to no avail up to the time when another big name of the American
financial academy, Mark Rubinstein, was interested to the case by a small group of
Italian researchers. Thanks to Mark Rubinstein, Markowitz agreed to give his opinion
on the topic and he did so in: Harry Markowitz: “deFinetti scoops Markowitz”, Journal
of Investment Management, Vol. 4, No. 3, Third Quarter 2006.
Mark Rubinstein added a preface to the paper where he acknowledged a number
of priorities to deFinetti, in the quoted and other papers, among which: early work on
martingales (1939), mean-variance portfolio theory (1940), portfolio variance as a sum
of covariances, concept of mean-variance efficiency normality of returns, implications
of “fat tails”, bounds on negative correlation coefficients, early version of the critical
line algorithm, notion of absolute risk aversion (1952), early work on optimal dividend
policy (1957), early work on Samuelson’s consumption loan model of interest rates
(1956)
In his paper Markowitz recognizes deFinetti priority. However he concentrates
most of the paper in criticizing a marginal point in the algorithm deFinetti suggests
for finding the optimal portfolio. A marginal point both because deFinetti imprecision
can easily be corrected in general while it is already perfectly correct for any “sensible”
case, and because, while Markowitz suggested in 1956 an algorithm for solving the
portfolio optimization quadratic programming problem, he was neither the inventor of
quadratic programming nor his contribution to portfolio optimization was ever gauged
on the basis of this algorithm. In fact the 1952 paper contains no algorithm at all but
it still contains all the basic ideas of optimal portfolio theory.
Scientists are people, with all the standard weaknesses. So, it is perhaps ungenerous
to recall the scathing answer given by Markowitz to Peter Bernstein worried, after
reading Rubinstein introduction and Markowitz paper, about the status of one of his
“heroes of the theory of Finance” (Capital Ideas Evolving, 2007, Page. 109) “When I
asked Markowitz what he would have done if someone had shown him the deFinetti
207
paper while he was working on his thesis, his response was unqualified: ’I would have
seen at once that deFinetti was related to my portfolio selection work, but by no means
identical to it. I guess I would have given him a footnote in my paper’”.
This seems to be even less than what admitted by Markowitz in his 2006 paper.
Indeed, if one reads the 1952 Markowitz paper having deFinetti work in mind, the
meaning of “by no means identical” becomes clear: Markowitz paper is a very particular
case of deFinetti’s general approach. The question could then be: which should be the
paper and which the footnote...
Peter Bernstein comment to this answer is quite cryptic: so naive it could be
understood as intentionally naive: “This answer was a great relief to me. As it should be
to all who appreciate the value of Capital Ideas to the world of investing. Markowitz’s
work on portfolio selection was the foundation of all that followed in the theory of
Finance, and of the Capital Asset Pricing Model in particular”.
Alas, I was unable to ask Bernstein about the true meaning of this sentence as he
died in 2009 before I read his book.
16
Appendix: Gauss Markoff theorem
The Gauss-Markoff theorem is a nice subject case in the history of scientific attribution.
It is generally difficult and sometimes vain to assess scientific priorities. New ideas
are quite infrequently “new”, they typically ripe out of a rich story of attempts and crystallize into new views only in the academically regulated opinion of posterity. However
to trace back some concept to its fuzzy origin is always an interesting and often an
illuminating endeavor. Students of Finance should perhaps look after the origin of option pricing and Arrow state price density in Vinzenz Bronzin work of 190870 , put call
parity in De La Vega (1688) some three hundred years before the “official” discovery by
Stoll (1969 Journal of Finance), optimal portfolio theory (and much else) as already
mentioned in deFinetti (1940), and so on.
In the same vein the least squares method, having to do with Pythagoras theorem,
is quite old but, in its modern form and applied to the interpolation of noisy data, it
is sometimes credited to the young Gauss (1795). However it does exist a manuscript
dated from the 1770’s by the great Italian mathematician Lagrange (would be Lagrangia changed into Lagrange in France where he spent most of his active life). On
the other end Lagrange memoir was probably inspired by a work of Simpson (1757).
These and related work have to do with the “theory of errors” that is they considered the best way of putting together observations subject to what we could today
call “random error” in order to better “estimate” (again a modern word) one or more
unknown quantity. This is one of the origin of the concept of, in general weighted,
arithmetic average (the other, more ancient, and from this comes the name “average”
70
see “Vinzenz Bronzin’s option pricing models, exposition and appraisal”, Springer 2009
208
which has to do with “avaria” an Italian term coming from Arabic, is quite peculiar: it
comes from a rule for distributing across all merchants whose goods are the freight of
a ship, losses due to jettisoning in order to save the ship).
As we know, the arithmetic mean of n numbers minimizes the sum of squares of
the differences between the given numbers and the mean itself. So, it is a least squares
estimate of a constant.
A precise description of the method was given by Legendre (1805) who was probably
the first to use its modern name “least squares” (moindres carrés). Moreover Legendre
explicitly considered the application of the method to linear approximations. Gauss
quoted Legendre in his astronomical work “Theoria motus” of 1809 and mentioned the
fact that the method of Legendre had been used by himself in 1795. Legendre was
quite upset by this and wrote to Gauss to complain about this appropriation.
Laplace published between 1812 and 1820 his “Théorie analytique des probabilités”.
Where least squares are given an important (if mathematically quite cumbersome)
place. The renown of this book in the nineteen century was very important in establishing the method and tended to overshadow Gauss contribution. This may seem
strange as Laplace himself wrote in his Théorie a short historical note on the theory of
errors very similar to the one summarized here.
If there may be some controversy about the invention of least squares, there is no
controversy about the so called “Gauss-Markoff” theorem.
Neither Legendre nor Laplace or Lagrange or Simpson ever stated or gave a proof
of this theorem.
The Gauss-Markoff theorem has to do not with least squares as an algorithm but
with conditions for the optimality of least squares. It deals, in other words, not with
least squares as a tool for fitting “models” to given “data” but with least squares properties from a probabilistic point of view, that is, considering all possible potential
sampoles. This is a paramount instance of a new attitude which was developing between the end of 18th and the beginning of 19th century where Probability, originally
developed from gambling problems, begins to be seen as a tool for justifying what
today we would call “statistical inference”.
The nearest thing to Gauss theorem in in Laplace’s work, is a result about what
we could today call “consistency” of the least square estimate: again a probabilistic
justification for a statistical method.
As far as I know the proof of the “Gauus-Markoff” theorem was first given by
Gauss in his celebrated “Theoria combinationis observationum erroribus minimis obnoxiae” which we could translate: “Theory of the least affected by errors combination
of observations” or “Theory of those combinations of observations which are least affected by errors” (1821). A work where, by the way, Gauss (re)introduces the Gaussian
density (but does not ask for the errors to be Gaussian) and also gives a table for some
of its quantiles (a curiosity: Gauss table does not contain the .975 quantile, the famous
1.96).
209
In this work it could be difficult to recognize at first reading the theorem for at least
two reasons. First: while the result is there even in a greater generality than what we
are used to, the kind of mathematical language is quite different from current one.
Second: the result is not directly connected with the linear model (which is a special
case) but with the general problem of “estimating” (our word) one or more unknown
constants observed with error (the theory of errors mentioned above).
It is to be noticed that the case of unequal variances across errors is considered too
(in fact this GLS approach is used along all the work).
Gauss already gave a related result some year before, in 1809 in the astronomical
work “Theoria motus”, but with a different argument which today we could describe as
a mix of bayesianism and maximum likelihood.
Why the name of Markoff?
The great Russian mathematician gives a proof of the theorem in Ch. 7 of his
book “Исчисление вероятностей” (Calculus of probabilities). This became available
in an abriged German translation in 1912 (with a preface by the Author himself). At
a direct reading, it is clear that Markoff chapter on least squares is a slightly more
modern and detailed version of Gauss work (included the use of the Gaussian density).
In fact, the first bibliographic entry at the end of the chapter, German version, is a
German translation from 1887 of Gauss original Latin. This is not an addition, but a
slight alteration, by the translators: the original Russian version quotes the same work
by Gauss as translated in French by J. Bertrand in 1855.
Markoff work was quoted by J. Neyman (a Polish-American mathematical statistician, one of the creators of hypothesis testing and, more in general, of modern mathematical Statistics) in a discussion paper he presented at the Royal Statistical Society
June 19th, 1934: “On the Two Different Aspects of the Representative Method: The
Method of Stratified Sampling and the Method of Purposive Selection”, Journal of the
Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625. He names “Markoff
method” (Note II. pag. 563, 593) a set of procedures inspired to the above quoted
chapter in Markoff book. Neyman, pag. 563, gives references for both the Russian
and the German version of Markoff book. Neyman, however, seems to give Markov
the merit of the result, or at least of: “the clear statement of the problem. The subsequent theory is matter of easy algebra”. Neyman doubts the method to be known,
pag. 564, due to the fact that it was published in Russian. Neyman seems to ignore
Gauss work on the topic and does not quote Gauss. The right attribution to Gauss
was immediately reinstated by R. A. Fisher in the discussion following the paper, pag.
616.
The short historical note by Plackett: Biometrika, Vol. 36, No. 3/4 (Dec., 1949),
pp. 458-460, could be an interesting reading as it sets things straight. In this note,
moreover, we have one of the first instances where the result is phrased using matrix
notation in a way similar to the one we use today. A more thorough historical paper
with more details about our story is: “Studies in the History of Probability and Statis210
tics. XV: The Historical Development of the Gauss Linear Model” by Hilary L. Seal
(Biometrika, Vol. 54, No. 1/2 (Jun., 1967), pp. 1-24).
A last note about matrix notation. Modern matrix notation is relatively new in
Statistics. One of the first and, arguably, the first famous instance of a modern matrix based presentation of least squares, and GLS (but no explicit version of “GaussMarkoff” theorem) is in a paper by Alexander Aitken: “On least squares and linear
combination of observations,” Proc. Royal Soc. Edinburgh, 55, (1935), 42-48. The
New Zealander mathematician, beyond many important results, also vested an important role as an influential pioneer of the use of matrix calculus and matrix notation in
Statistics.
Summary: the theorem, in its general form, is by Gauss; Markoff never pretended
to be the author of the result and explicitly quotes Gauss; Markoff name was added
to that of Gauss probably because of a wrong attribution by Neyman; this became
common use notwithstanding the immediate correction by the none less that R. A.
Fisher.
17
Exercises and past exams
All past exams are available with solutions on Francesco Corielli webpage.
The following table is a cross reference of each exercise in each past exam classified
by argument. Since some exercise requires topics from different chapters, the classification is only approximate. If you download the exam pdf files in the same directory
as the handouts, a click on the date should open the proper file.
211
2005-12-23
2006-01-11
2006-02-09
2006-04-04
2007-01-24
2007-02-12
2007-09-11
2007-12-19
2008-01-23
2008-02-11
2008-09-12
2008-12-17
2009-01-21
2009-02-09
2009-09-11
2009-12-15
2010-01-20
2010-02-08
2010-09-10
2010-12-14
2011-01-26
2011-02-07
2011-09-09
2011-12-13
2012-01-18
2012-02-06
2012-09-07
2012-12-11
Subjects ⇒
Dates ⇓
212
4
46
26
2
1
1
5
678
68
7
6
56
7
13
17
17
1
1
1
7
1
7
2
Basic prob
and Stats
19
6
3
1
2
7
4
5
74
7
5
4
4
4
24
4
4
5
7
5
3
2
VaR
2
2
2
2
Variance
estimation
3
2
5
34
3
7
3
4
6
36
5
5
7
247
34
3
Risk premia, returns
and time diversification
1 2 2.1
1
Linear
Model
9
456
456
3456
5
48
356
3
1235
23
25
167
1367
234
235
1357
3
13
4
15
15
27
12
1
1467
5
2346
17
256
5
6
4
4
Style
Analysis
10
5
25
6
7
Factor
Models
11
78
6
17
6
5
12
6
7
1
3
5
3
2
1
6
6
5
4
4
78
8
6
6
Principal
Components
11.2
2
3
7
23
23
6
7
3
2
4
1
3
4
16
6
5
9
9
Black and
Litterman
12
2013-01-15
2013-02-06
2013-09-06
2013-12-10
2014-01-15
2014-02-05
2014-09-05
2014-12-09
2015-01-14
2015-02-04
2015-09-04
2015-12-09
2016-01-13
2016-02-04
2016-09-09
2016-12-12
2017-01-18
2017-02-02
2017-09-01
2017-12-19
2018-01-22
2018-09-03
2018-12-18
2019-01-21
Subjects ⇒
Dates ⇓
213
6
12
15
1
6
4
4
34
7
5
7
47
4
5
1
7
56
7
5
4
6
6
5
3
3
Variance
estimation
3
7
Basic prob
and Stats
19
2
5
1
5
5
4
4
1
3
5
5
3
2
2
67
5
4
3
3
2
1
5
2
VaR
7
7
34
3
36
7
17
1
6
3
7
5
7
6
7
Risk premia, returns
and time diversification
1 2 2.1
3
6
Linear
Model
9
1
145
4
13
245
347
4
56
1256
247
257
246
12
27
346
3457
15
256
47
12
124
27
12
12
1
6
6
3
7
1
4
Style
Analysis
10
6
6
7
Factor
Models
11
3
3
3
3
2
2
1
1
6
2
3
6
Principal
Components
11.2
6
27
2
5
6
5
1
6
3
4
4
3
Black and
Litterman
12
5
18
18.1
Appendix: Some matrix algebra
Definition of matrix
A matrix A is an n−rows m−columns array of elements the elements are indicated by
ai,j where the first index stands for row and the second for column. n and m are called
the row and column dimensions (sometimes shortened in “the dimensions”) or sizes of
the matrix A. Sometimes we write: A is a nxm matrix.
Sometimes a matrix is indicated as A ≡ {aij }.
When n = m we say the matrix is square.
When the matrix is square and aij = aji we say the matrix is symmetric.
When a matrix is made of just one row or one column it is called a row (column)
vector.
18.2
Matrix operations
1. Transpose: A0 = {aji }. A00 = A. If A is symmetric then A0 = A.
2. Matrix sum. The sum of two matrices C = A + B is defined if and only if
the dimensions of the two matrices are identical. In this case C has the same
dimensions as A and B and cij = aij + bij . Clearly A + B = B + A and (A + B)0 =
A0 + B 0
3. Matrix product. The product C = AB of two matrices nxm and qxk
Pis defined
if and only if m = q. If this is the case C is a nxk matrix and cij = l ail blj . In
the matrix case it may well be that AB is defined but BA not. An important
property is C 0 = B 0 A0 or, that is the same, (AB)0 = B 0 A0 . Provided the products
and sums involved in what follows are defined we have (A + B)C = AC + BC.
18.3
Rank of a matrix
A row vector x is said to be linearly dependent from the row vectors of a matrix A if
it is possible to find a row vector z such that x = zA. The same for a column vector.
r(A) (or rank(A)): the rank or a matrix A, is defined as the number of linearly
independent rows or (the number is the same) the number of linearly independent
columns of A.
A square matrix of size n is called non singular if r(A) = n.
If B is any n × k matrix, then r(AB) ≤ min(r(A), r(B)).
If B is an n × k matrix of rank n, then r(AB) = r(A).
If C is an l × n matrix of rank n, then r(CA) = r(A).
214
18.4
Some special matrix
1. A square matrix A with elements aij = 0, i 6= j is called a diagonal matrix.
2. A diagonal matrix with the diagonal of ones is called identity and indicated with
I. IA = A and AI = A (if the product is defined).
3. A matrix which solves the equation AA = A is called idempotent.
18.5
Determinants and Inverse
There are several alternative definitions for the determinant of a square matrix.
for the determinant of an n × n matrix A is det(A) = |A| =
P The Leibniz
Qformula
n
σ∈Sn sgn(σ)
i=1 Ai,σi .
Here the sum is computed over all permutations σ of the set 1, 2, ..., n. sgn(σ)
denotes the signature of σ; it is +1 for even σ and −1 for odd σ. Evenness or oddness
can be defined as follows: the permutation is even (odd) if the new sequence can be
obtained by an even number (odd, respectively) of switches of elements in the set.
The inverse of a square matrix A is the solution A−1 (or inv(A)) to the equations
A−1 A = I = AA−1 .
If A is invertible then (A0 )−1 = (A−1 )0
The inverse of a square matrix A exists if and only if the matrix is non singular
that is if the size and the rank of A are the same.
A square matrix is non singular if and only if it has non null determinant.
det(A−1 ) = 1/ det(A)
If the products and inversions in the following formula are well defined (that is
dimensions agree and the inverse exists), then (AB)−1 = B −1 A−1 .
Inversion has to do with the solution of linear non homogeneous systems.
Problem: find a column vector x such that Ax = b with A and b given.
If A is square and invertible then the unique solution is x = A−1 b.
If A is n × k with n > k but r(A) = k then the system Ax = b has in general no
exact solution, however the system A0 Ax = A0 b has the solution x = (A0 A)−1 A0 b.
18.6
Quadratic forms
A quadratic form with coefficient matrix given by the symmetric matrix A and variables
vector given by the column vector x (with size of A equal to the number of rows of x)
is the scalarPgiven
P by:
0
x Ax = i j aij xi xj .
A symmetric matrix A is called semi positive definite if and only if
x0 Ax ≥ 0 for all x
It is called positive definite if and only if
215
x0 Ax > 0 for all non null x
If a matrix A can be written as A = C 0 C for any matrix C then A is surely at least
psd. In fact x0 Ax = x0 C 0 Cx but this is the product of the row vector x0 C 0 times itself,
hence a sum of squares and this cannot be negative. It is also possible to show that
any psd matrix can be written as C 0 C for some C.
18.7
Random Vectors and Matrices (see the following appendix
for more details)
A random vector, resp matrix, is simply a vector (matrix) whose elements are random
variables.
18.8
Functions of Random Vectors (or Matrices)
• A function of a random vector (matrix) is simply a vector (or scalar) function of
the components of the random vector (matrix).
• Simple examples are: the sum of the elements of the vector, the determinant of
a random matrix, sums or products of matrices and vectors and so on.
• We shall be interested in functions of the vector (matrix) X of the kind: Y =
A + BXC where A, B and C are non stochastic matrices of dimensions such that
the sum and the products in the formula are well defined.
• A quadratic form x0 Ax with a non stochastic coefficient matrix A and stochastic
vector x is and example of non linear, scalar function of a random vector.
18.9
Expected Values of Random Vectors
• These are simply the vectors (matrices) containing the expected values of each
element in the random vector (matrix).
• E(X 0 ) = E(X)0
• An important result which generalizes the linear property of the scalar version
of the operator E(.) for the general linear function defined above, is this E(A +
BXC) = A + BE(X)C.
18.10
Variance Covariance Matrix
• For random column vectors, and here we mean vectors only, we define the variance
covariance matrix of a column vector X as:
V (X) = V (X 0 ) = E(XX 0 ) − E(X)E(X 0 ) = E((X − E(X))(X − E(X))0 )
216
• The Varcov matrix is symmetric, on the diagonal we have the variances (V (Xi ) =
2
σX
) of each element of the vector while in the upper and lower triangles we have
i
the covariances (Cov(Xi ; Xj )).
• The most relevant property of this operator is:
V (A + BX) = BV (X)B 0
• From this property we deduce that varcov matrices are always (semi) positive
definite. In fact if A = V (z) and x is a (non random) column vector of the same
size as z, then V (x0 z) = x0 Ax which cannot be negative for any possible x.
18.11
Correlation Coefficient
• The correlation coefficient between two random variables is defined as:
ρXi ;Xj =
Cov(Xi ; Xj )
σXi σXj
The correlation matrix %(x) of the vector x of random variables is simply the
matrix of correlation coefficients or, that is the same, the Varcov matrix of the
vector of standardized Xi .
• The presence of a zero correlation between two random variables is defined, sometimes, linear independence or orthogonality. The reader should be careful using
these terms as they exist also in the setting of linear algebra but their meaning, even if connected, is slightly different. Stochastic independence implies zero
correlation, the reverse proposition is not true.
18.12
Derivatives of linear functions and quadratic forms
Often we must compute derivatives of functions of the kind x0 Ax (a quadratic form)
or x0 q (a linear combination of elements in the vector q with weights x) with respect
to the vector x.
In both cases we are considering a (column) vector of derivatives of a scalar function
w.r.t. a (column) vector of variables (commonly called a “gradient”). There is a useful
matrix notation for such derivatives which, in these two cases, is simply given by:
∂x0 Ax
= 2Ax
∂x
and
∂x0 q
=q
∂x
217
The proof of these two formulas is quite simple. In both cases we give a proof for
a generic element k of the derivative column vector.
For the linear combination we have
X
x0 q =
xj q j
j
∂x0 q
= qk
∂xk
For the quadratic form
∂x0 Ax
= 2x0 A
∂x0
XX
x0 Ax =
xi xj ai,j
i
∂
P P
i
j
xi xj ai,j
∂xk
=
X
j6=k
j
X
X
X
xj ak,j +
xi ai,k +2xk ak,k =
xj ak,j +
xj ak,j +2xk ak,k = 2Ak,. x
i6=k
j6=k
j6=k
Where Ak,. means the k − th row of A and we used the fact that A is a symmetric
matrix.
An important point to stress is that the derivative of a function with respect to a
vector always has the same dimension as the vector w.r.t. the derivative is taken, in
this case x, so, for instance
∂x0 Ax
= 2Ax
∂x
and not
∂x0 Ax
= 2x0 A
∂x
(remember that A is symmetric).
18.13
Minimization of a PD quadratic form, approximate solution of over determined linear systems
Now Let us go back to the linear system Ax = b with A an n × k matrix of rank k.
If n > k this system has, in general, no solution. However, let’s try to solve a similar
problem. By solving a system we wish for Ax − b = 0 in our case this is not possible
so let us try and change the problem to this minx (Ax − b)0 (Ax − b). In words try to
minimize the sum of squared differences between Ax and b if you cannot make it equal
to 0.
218
We have
(Ax − b)0 (Ax − b) = x0 A0 Ax + b0 b − 2b0 Ax
Now let us take the derivative of this w.r.t. x
∂ 0
x AA0 x + b0 b − 2b0 Ax = 2A0 Ax − 2A0 b
∂x
(remember the rule about the size of a derivatives vector). We now create a new
linear system equating these derivatives to 0.
A0 Ax = A0 b
And the solution is
x = (A0 A)−1 A0 b
This is the “least squares” approximate solution of a (over determined) linear system.
(see the Appendix on least squares and Gauss Markov model).
18.14
Minimization of a PD quadratic form under constraints.
Simple applications to Finance
Suppose we are given a column vector r where rj is the random (linear) return for the
stock j.
Suppose we are holding these returns in a portfolio for one time period and that the
(known) relative amount of each stock in our portfolio is given by the column vector
w such that 10 w = 1 where 1 indicates a column vector of ones of the same size as w.
Then the random linear return of the portfolio over the same time period is given
by rπ = w0 r.
Since w is known we have E(w0 r) = w0 E(r) and V (w0 r) = w0 V (r)w.
The fact that, over one period of time, the expected linear return and the variance
of the linear return of a portfolio only depend on the expected values and the covariance
matrix of the single returns and the weight vector is what allows us to implement a
simple optimization method. For the moment let us suppose that the problem is
min w0 V (r)w
w:10 w=1
In this problem we want to minimize a quadratic form under a linear constraint.
It is to be noticed that, without the constraint, the problem would be solved by
w = 0 (no investment). The constraint does not allow for this.
Such problems can be solved with the Lagrange multiplier method.
The idea is to artificially express, in a single function, both the need of minimizing
the original function and the need to do this with respect to the constraint 10 w = 1.
219
In order to do this we define the Lagrangian of the problem given by
L(w, λ) = w0 V (r)w − 2λ(10 w − 1)
In this function the value of the unconstrained objective function is summed with the
value of the constraint multiplied by a dummy parameter 2λ.
We now take the derivatives of the Lagrangian w.r.t. w and λ.
∂
(w0 V (r)w − 2λ(10 w − 1)) = 2V (r)w − 2λ1
∂w
∂
(w0 V (r)w − 2λ(10 w − 1)) = 2(10 w − 1)
∂λ
If we set both these to zero we get, supposing V (r) invertible
V (r)w = λ1
10 w = 1
Notice the difference between the 1-s. In the first equation 1 is a column vector which
is required because we cannot equate a vector to a scalar. The same for 1’ in the second
equation which while the r.h.s. is a scalar one (for dimension compatibility with the
l.h.s.). We do not stress this using, e.g., boldface for the vector 1 because the meaning
follows unambiguously from the context.
It is clear that the second equation is satisfied if and only if w satisfies the constraint.
What is the meaning of the first equation (or, better, set of equations)? The
unconstrained equation would have been
V (r)w = 0
whose only solution (due to the fact that V (r) is invertible) would be w = 0. But this
solution does not satisfy the constraint. What we shall be able to get is V (r)w = λ1.
For some λ chosen in such a way that the constraint is satisfied.
To find this λ, simply put together the result of the first set of equations: w =
λV (r)−1 1 and the equation expressing the constraint: 10 w = 1. Both equations are
satisfied if and only if
λ = 1/10 V (r)−1 1
We now know λ, that is we know of exactly how much we must violate the unconstrained
optimization condition (first set of equations) in order to satisfy the constraint (second
equation).
In the end, putting this value of λin the solution for the first set of equations, we
get
V (r)−1 1
w= 0
1 V (r)−1 1
It is to be noticed that these are only necessary conditions but, for our purposes,
this is enough.
220
What we got is the one period ”minimum variance portfolio” made of securities
whose returns covariance is V (r).
What is the variance of this portfolio?
V (w0 r) = w0 V (r)w =
10 V (r)−1
V (r)−1 1
10 V (r)−1 1
1
V
(r)
=
=
10 V (r)−1 1
10 V (r)−1 1
(10 V (r)−1 1)2
10 V (r)−1 1
The expected value shall be
E(w0 r) = w0 E(r) =
1V (r)−1 E(r)
10 V (r)−1 1
If V (r) is only spd, then it shall not be invertible, so that the system V (r)w = λ1
cannot be solved by simple inversion. In this case, however, there shall exist nonnull
vectors w∗ such that w∗0 V (r)w∗ = 0 and, using such w∗ it shall be possible to build
portfolios of the securities with (linear) return vector r, and maybe the risk free, such
that the return of such portfolios is risk free (zero variance) even if its components are
risky. Such riskless retunr must be equal to the risk free rate for no arbitrage to hold.
18.15
The linear model in matrix notation
Suppose you have a matrix X of dimensions n × k containing n observations on each
of k variables. You also have a n × 1 vector y containing n observations on another
variable.
You would like to approximate y with a linear function of X that is: Xb for some
k×1 vector b.
In general, if n > k it shall not be possible to exactly fit Xb to y so that the
approximation shall imply a vector of errors = y − Xb.
You would like to minimize but this is a vector, we must define some scalar
function of it we wish to minimize.
A possible solution is 0 that is: the sum of squares of the errors.
We then wish to minimize
0 = (y − Xb)0 (y − Xb) = y 0 y + b0 X 0 Xb − 2y 0 Xb
If we take the derivative of this w.r.t b we get
∂ 0
(y y + b0 X 0 Xb − 2y 0 Xb) = 2X 0 Xb − 2X 0 y
∂b
(again remember the size rule and remember that y 0 Xb = b0 X 0 y each is the transpose
of the other but both are scalars).
The solution of this is
b = (X 0 X)−1 X 0 y
221
This simple application of the rule for the approximate solution of an over determined
system yields the most famous formula in applied (multivariate) Statistics. When this
problem, for the moment just a best fit problem, shall be immersed in the appropriate
statistical setting, our b shall become the Ordinary Least Squares parameter vector
and shall be of paramount relevance in a wide range of applications to Economics and
Finance.
19
Appendix: What you cannot ignore about Probability and Statistics
A quick check
The following simple example deal with the relations and differences between probability concepts and statistical concepts.
Let us start from two simple concepts: The mean and the expected value.
You know that an expected value has to do with a probability model: you cannot
compute it if you do not know the possible values of a random variable and their
probabilities.
On the other hand an average or mean is a simpler concept involving just a set of
numerical values: you take the sum of the values and divide by their number.
Sometimes, if certain assumptions hold (e.g. iid data), an expected value can be
estimated using a mean computed over a given dataset.
Moreover when a mean is seen not as an actual number, involving the sum of
actually observed data divided by the number of summands, but as a sum of still
unobserved, hence random, data, divided by their number, a mean becomes a random
quantity, being a function of random variables, hence entering the field of Probability
and need for its description a probability model. As such, it is reasonable to ask for
its probability distribution, expected value and a variance. In fact this is the study
of “sampling variability” for an estimate. At the opposite, probability distribution,
expected value and variance have no interesting meaning for a mean of a given set of
numbers which has one and only one possible value.
This dualism between quantities computed on numbers and functions of random
variables is true for all other statistical quantities.
In the case of the mean/average vs expected value, we use (but not always) different
names to stress the different role of the objects we speak of. The same is done (usually)
when we distinguish “frequency” and “probability”. To apply this to each statistical
concept would be a little cumbersome and, in fact, is not done in most cases. A variance
is called a variance both when used in the probability setting and as a computation on
222
number, the same for moments, covariances etc. Even the word “mean” is often used
to indicate both expected values and averages. This is a useful shortcut but should
not trick us in believing that the use of the same name implies identity of properties.
Care must be used.
In the experience of any teacher of Statistics the potential misunderstandings which
can derive from an incomplete understanding of this basic point are at the origin of
most of the problems students incur in when confronted with statistical concepts.
Consider the following example and, even if you judge it trivial, dedicate some time
to really repeat and understand all its steps.
Suppose you observe the numbers 1,0,1. The mean of these is, obviously, 2/3. Is it
meaningful to ask questions about the expected value of each of these three numbers
or of the mean? Not at all, except in the very trivial case where the answer to this
question coincides with the actual observed numbers.
However, in most relevant cases the numbers we may observe are not predetermined.
They are obviously known after we observe data, but it is usually the case that we also
want to say something about their values in future possible observations (e.g. we must
decide about taking some line of action whose actual result depends on the future values
of observable. This is the basic setting in financial investments). We cannot do this
without the proper language, we need a model, written in the language of Probability,
able to describe the “future possible observations”.
For instance, we could think sensible to assume that each single number I observe
can only be either 0 or 1, that each possible observation has the same probability
distribution for the possible results: P for 1 and 1 − P for 0 and that observations are
independent that is: the probability of each possible string of results is nothing but
the product of the probability of each result in the string.
Since we only know that P is a number between 0 and 1 the mean computed above
using data from the phenomenon such modeled (in this case equivalent to the “relative
frequency” of 1), has a new role: it could be useful as an “estimate” of P .
Under our hypotheses, however, it is clear that the value 2/3 is only the value of
our mean for the observed data, it is NOT the value of P which is still an unknown
constant. We need something connecting the two.
The first step is to consider the possible values that the mean could have had on
other possible “samples” of three observations.
By enumeration these are 0, 1/3, 2/3, 1.
We can also compute, under our hypotheses the probabilities of these values. Since
a mean of 1 can happen only when we observe three ones, and since the three results
are independent and with the same probability P , we have that the probability of
observing a mean of 1 is P P P = P 3 . On the other hand a mean of 0 can only be
observed when we only observe zeroes, that is with probability (1 − P )3 . A mean of
1/3 can be obtained if we observe a 1 and two 0s. There are three possibilities for
this: 1,0,0; 0,1,0 and 0,0,1. the respective probabilities (under our hypotheses, are)
223
P (1 − P )(1 − P ); (1 − P )P (1 − P ) and (1 − P )(1 − P )P. The three possibilities exclude
each other so we can sum up the probabilities. In the end we have 3P (1 − P )2 . The
same reasoning gives us the probability of observing a mean of 2/3 that is: 3P 2 (1 − P ).
What we just did is to compute the “sampling distribution” of the mean seen as an
“operator” which we can apply to any possible sample.
This sampling distribution of the mean gives us all its (four) possible values (on
n = 3 samples) and their probabilities as functions of P .
Since we now have both the possible values and their probabilities we can compute
the expected value and the variance of this mean. This is the second step to take in
order to connect the estimate to the “parameter” P .
These computations shall give us information about how good an “estimate” the
mean can be of the unknown P . We would like the mean to have expected value P
(unbiasedness) and as small a variance as possible, so to be “with high probability”
“near” to the true but unknown value of P . what this last sentence means is, simply,
that the probability of observing samples where the mean has a value near P should
be big.
Formally an expected value is very similar to a mean with the difference that each
value of the (now) random variable is multiplied by its probability and not its frequency.
The expected value shall be:
0 ∗ (1 − P )3 + 1/3 ∗ 3 ∗ P (1 − P )2 + 2/3 ∗ 3 ∗ P 2 (1 − P ) + 1 ∗ P 3 = P
Notice the difference with respect to a mean computed on a given sample and be
careful not to mistake the point. The difference is not that the result is not a number
but an unknown “parameter”. It could well be that P is known and, say, equal to 1/3 so
that the result would be a number. The difference is that this quantity, the expected
value of the sample mean, is a probability quantity, has nothing to do with actual
observations and frequencies and has everything to do with potential observations and
probability. In fact, on each given sample we have a given value of the mean so its
expected value has a meaning only because we consider the variability of the values
of this mean on the POSSIBLE samples. However, the result is very useful both if P
is unknown and if it is known. When P is unknown it tells us that the mean of the
observed data shall be unbiased as an estimate for P . When P is known to be, say,
1/3 it shall give P an “empirical connection” to an observable quantity, by assessing
that the expected value of the mean of the observed data shall be 1/3.
The question, however is: OK, this for the expected value of the sample mean. But:
how much is it probable that the actual observed mean be “near” P ?
Well, suppose for instance P = 1/3, we immediately see that the probability of
observing a mean equal to 1/3 is 1/9 while the probability of observing a mean between
0 and 2/3 is 1-1/9 very near to 1. However this computation is quite cumbersome and
difficult to make for bigger samples than our 3 observations.
224
A more useful answer to this question requires the computation of the (sampling)
variance of our mean. By using the definition of variance and the probabilities already
computed we get for the variance of the mean:
P (1 − P )/3
The general case, for a sample of size not 3 but n shall be:
P (1 − P )/n
Clearly the bigger n the smaller the variance.
This, again, for unknown P is an unknown number. However we can say much
about its value. In fact, since P is between 0 and 1, P (1 − P ) has a maximum value
of 1/4 (and this is the exact value when P = 1/2).
How is this connected with the probability of observing a mean “near” to P ?
The answer is given by Tchebicev inequality. This says that for any random variable
X (hence also for the sample mean) we have:
P rob(E(X) − k
p
p
V (X) < X < E(X) + k V (X)) ≥ 1 − 1/k 2
(For any positive k). This implies, for instance, that if P = 1/3 and n = 3, there
is at least a probability of .75 that the sample mean be observed between the values
1/3-.3142 and 1/3+.3142.
This is already a very useful information, but think what happens when the sample
size is not 3 but, say, 100. In this case the above interval becomes much more narrow:
1/3-.0544 1/3+.0544. Even if P is unknown with n = 100 the interval for at least a
probability of .75 shall never be wider that ±.1 (= ±2 ∗ 1/(4 ∗ 100) ) around the “true”
P . Results the like of central limit theorem allow us to be even more precise, but this
is outside the scope of this exercise.
Beyond the numbers what this boils down to is that, by studying the sample mean
as a random variable, random due to the fact that the sample values are random before
observing them, we are able to connect a parameter in the model: P to an observable
quantity, the sample mean. By converse, we also understand the empirical role of P
in determining the probability of different possible observations of the sample mean:
bigger P implies higher probability of observing a bigger mean.
These are simple instances of two basic points in Statistics as applied to any science: we call the first “(statistical) inference” (transforming information on observed
frequencies into information on probabilities) and the second “forecasting” (assessing
the probabilities of observations still to be made).
This longish example is not intended to teach you any new concept: with the
possible exception of Tchebicev inequality all this should be already known after a BA
in Economics.
225
You can take it as follows: if you see all the concepts and steps in the example as
clear, even trivial, fine! All what follows in this course shall be quite easy.
On the other hand, if any step seems fuzzy or inconsequential, dedicate some more
time to a quick rehearsal of what you already did during the BA concerning Probability
and Statistics.
And for any problem ask your teachers.
How should you use what follows?
In the following paragraphs you shall find a quick summary of basic Probability and
Statistics concepts.
A good understanding of basic concepts in modern Finance and Economics as applied to the fields of asset pricing, asset management, risk management and Corporate
Finance (that is: what you do in the two years master), would require a full knowledge
of what follows. As far as this course is concerned a good understanding of the strictly
required statistical and Probability concepts (in fact really basics!) can be derived by
simply examining the questions asked in past exams. Moreover, before section 1 and
section 6 of these handouts you can find a short list of concepts that are essential to
the understanding of the first and the second part of these handouts.
In what follows a small number of less essential points, preceded by an asterisk, can
be left out.
The following summary is (obviously) NOT an attempt to write an introductory
text of Probability and Statistics. It should be used as a quick summary check: Browse
thru it, check if most of the concepts are familiar.
In the unlikely case the answer is not (this could be the case for students coming
for different field BAs) you should dedicate some time to upgrade your basic notions
of Probability and Statistics.
For any problem and suggestion ask your teachers.
Probability
19.1
Probability: a Language
• Probability is a language for building decision models.
• As all languages, it does not offer or guarantees ready made splendid works of
art (that is: right decisions) but simply a grammar and a syntax whose purpose
is avoiding inconsistencies. We call this grammar and this syntax “Probability
calculus”.
• On the other hand, any language makes it simple to “say” something, difficult to
say something else and there are concepts that cannot be even thought in any
226
given language. So, no analysis of what we write in a language is independent on
the structure of the language itself, And this is true for Probability too.
• The language is useful to deduce probabilities of certain events when other probabilities are given, but the language itself tells us nothing about how to choose
such probabilities.
19.2
Interpretations of Probability
• A lot of (often quite cheap) philosophy on the empirical meaning of probability
boils down to two very weak suggestions:
• For results of replicable experiments, it may be that probability assessments have
to do with long run (meaning what?) frequency;
• For more general uncertainty situations, probability assessments may have something to do with prices paid for bets, provided you are not directly involved in
the result of the bet, except with regard to a very small sum of money.
• In simple situations, where some symmetry statement is possible, as in the standard setting of “games of chance” where probability as a concept was born, the
probability of relevant events can be reduced to some sum of probabilities of
“elementary events” you may accept as “equiprobable”.
19.3
Probability and Randomness
• Probability is, at least in its classical applications, introduced when we wish to
model a collective “random” phenomenon, that is an instance where we agree that
something is happening “under constant conditions” and, this not withstanding,
the result is not fully determined by these conditions and, a priori, unknown to
us.
• Traders are interested in returns from securities, actuaries in mortality rates,
physicists in describing gases or subatomic particles, gamblers in assessing the
outcomes of a given gamble.
• At different degrees of confidence, students in these fields would admit that, in
principle, it could be possible to attempt a specific modeling for each instance of
the phenomena they observe but that, in practice, such model would require such
impossible precision in the measurement of initial conditions and parameters to
be useless. Moreover computations for solving such models would be unwieldy
even in simple cases.
227
• For these reasons students in these fields are satisfied with a theory that avoids a
case by case description, but directly models possible frequency distributions for
collectives of observations and uses the probability language for these models.
19.4
Different Fields: Physics
• Quantum Physics seems the only field where the “in principle” clause is usually
not considered valid.
• In Statistical Physics a similar attitude is held but for a different reason. Statistical Physics describes pressure as the result of “random” hits of gas molecules on
the surface of a container. In doing this they refrain using standard arguments
of mechanics of single particle not because this would be in principle impossible
but because the resulting model would be in practice useless (for instance its
solution would depend on a precise measurement of position and moment of each
gas molecule, something impossible to accomplish in practice).
19.5
Finance
• Finance people would admit that days are by no means the same and that prices
are not due to “luck” but to a very complex interplay of news, opinions, sentiments
etc. However, they admit that to model this with useful precision is impossible
and, at a first level of approximation, days can be seen as similar and that it is
interesting to be able to “forecast” the frequency distribution of returns over a
sizable set of days.
• The attitude is similar to Statistical Physics where, however, hypotheses of homogeneity of underlying micro behaviours are more easy to sustain. Moreover
while we could model in an exact way few particles we cannot do the same even
with a single human agent.
19.6
Other fields
• Actuaries do not try to forecast with ad hoc models the lifespan of this or that
insured person (while they condition their models to some relevant characteristic the like of age, sex, smoker-no smoker and so on) they are satisfied in a
(conditional) modeling of the distribution of lifespan in a big population and in
matching this with their insured population.
• Gamblers compute probabilities, and sometimes collects frequencies. They would
like to be able to forecast each single result but their problem, when the result
depends on some physical randomizing device (roulette, die, coin, shuffled deck
228
of cards etc.) is exactly the same as the physicist’s problem at least when the
gamble result depends by the physics of the randomizing device.
• Very different and much more interesting is the case of betting (horse racing,
football matches, political elections etc.). In this case the repeatability of events
under similar conditions cannot be called as a justification in the use of probabilities and this implies a different and interesting interpretation of probability
which is beyond the scope of this summary.
• Weather forecasters, as all sensible forecaster (as opposed to fore tellers) phrase
their statements in terms of probabilities of basic events (sunny day, rain ,thunderstorms, floods, snow, etc.). In countries where this is routinely done and
weather forecasts are actually made in terms of probabilities (as in UK and USA
but not frequently in Italy) over time the meaning of , say, “60% probability of
rain” and the usefulness of the concept has come to be understood by the general
public (probability is not and should not be a mathematicians only concept).
• Risk managers in any field (the financial one is a very recent example) aim at
controlling the probability of adverse events
• Any big general store chain must determine the procedures for replenishing the
inventories given a randomly varying demand. This problem is routinely solved
by probability models.
• A similar problem (and similar solutions) is encountered when (random) demand
and (less random) offer of energy must be matched in a given energy grid; channels
must be allocated in a communication network; turnpikes must be opened or
closed to control traffic, etc.
• These are just examples of the applied fields where probability models and Statistics are applied with success to the solution of practical problems of paramount
relevance.
19.7
Wrong Models
• As we already did see, in a sense all probability models are “wrong”. With the
exception (perhaps) of Quantum Mechanics, they do not describe the behaviour
of each observable instance of a phenomenon but try, with the use of the non
empirical concept of probability, to directly and at the same time fuzzily describe
aggregate results: collective events.
• For this simple reason they are useful inasmuch the decision payout depends, in
some sense, on collectives of events.
229
• They are not useful for predicting the result of the next coin toss but they are
useful for describing coin tossING.
19.8
Meaning of Correct
• A good use of probability only guarantees the model to be self consistent. It
cannot guarantee it to be successful
• When the term “correct” is applied to a probability model (would be better to
call it “satisfactory”) what is usually meant is that its probability statement are
well matched by empirical frequencies (the term “well calibrated” is also used in
this sense).
• Sometimes, probability models are used in cases when the relevant event shall
happen only one or few times.
• In this case the model shall be useful more for organizing our decision process than
for describing its outcome. “Correct” in this case means: “a good and consistent
summary of our opinions”.
19.9
Events and Sets
• Probabilities are stated for “events” which are propositions concerning facts whose
value of truth can reasonably be assessed at a given future time. However, formally, probabilities are numbers associated with sets of points.
• Points represent “atomic” verifiable propositions which, at least for the purposes
of the analysis at hand, shall not be derived by simpler verifiable propositions.
• Sets of such points simply represents propositions which are true any time any
one of the (atomic) propositions within each set is true.
• Notice that, while points must always be defined, it may well be the case that we
only deal with sets of these and, while elements of these sets, some or all of these
points is not considered as a set by itself. For instance, in rolling a standard
die we have 6 possible “atomic” results but we could be interested only in the
probability of non atomic events the like of “the result is an even number” or “
the result is bigger than 3”. Since probabilities shall be assigned to a chosen class
of sets of points and we shall call these sets “events”, it may well be that these
“events” do not include atomic propositions (which in common language would
graduate to the name “event”).
230
• Sets of points are indicated by capital letters: A, B, C, .... The “universe” set (representing the sure event) is indicated with Ω and the empty set (the impossible
event) with ∅(read: “ou”).
• Finite or enumerable infinite collections of sets are usually indicated with {Ai }ni=1
and with {Ai }∞
i=1 .
• Correct use of basic Probability requires the knowledge of basic set theoretical
operations: A ∩ B Intersection, A ∪ B Union, A \ B Symmetric difference, A
negation and their basic properties. The same is true for finite and enumerable
infinite Unions and intersections: ∪ Ai , ∪ Ai and so on.
i=1...n
19.10
i=1...∞
Classes of Events
• Probabilities are assigned to events (sets) in classes of events which are usually
assumed closed with regard to some set operations.
• The basic class is an Algebra, usually indicated with an uppercase calligraphic
letter: A. An algebra is a class of sets which include Ω and is closed to finite
intersection and negation of its elements, that is: if two sets are in the class also
their intersection and negation is in the class. This implies that also the finite
union is in the class and so is the symmetric difference (why?).
• When the class of sets contains more than a finite number of sets, usually also
enumerable infinite unions of sets in the class are required to be sets in the class
itself (and so enumerable intersections, why?). In this case the class is called a
σ-algebra. The name “Event” is from now on used to indicate a set in and algebra
or σ-algebra.
19.11
Probability as a Set Function
• A probability is a set function P defined on the elements of an algebra such that:
P (Ω) = 1, P (A) = 1 − P (A) and for any finitePnumber of disjoint events {Ai }ni=1
(Ai ∩ Aj = Ø ∀i 6= j) we have: P ( ∪ Ai ) = ni=1 P (Ai ) .
i=1...n
• If the probability is defined on a σ-algebra we require the above additivity property to be valid also for enumerable unions of disjoint events.
19.12
Basic Results
• A basic result, implied in the above axioms, is that for any pair of events we
have: P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
231
• Another basic result is that if we have a collection of disjoint events: {Ai }ni=1
(Ai ∩ Aj = Ø ∀i 6= j)
another event B such that B = ∪ni=1 (Ai ∩ B) then we
Pand
n
can write: P (B) = i=1 P (B ∩ Ai )
19.13
Conditional Probability
• For any pair of events we may define the conditional probability of one to the
other, say: P (A|B) as a solution to the equation P (A|B)P (B) = P (A ∩ B).
• If we require, and we usually do, the conditioning event to have positive probability: P (B) 6= 0, this solution is unique and we have: P (A|B) = P (A ∩ B)/P (B).
19.14
Bayes Theorem
Using the definition of conditional probability and the above two results we can prove
Bayes Theorem.
n
n
SLet {Ai }i=1 be a partition of Ω in events, that is: {Ai }i=1 (Ai ∩ Aj = Ø ∀i 6= j)and
Ai = Ω, we have:
i=1...n
P (B|Ai )P (Ai )
P (Ai |B) = Pn
i=1 P (B|Ai )P (Ai )
19.15
Stochastic Independence
• We say that two events are “independent in the probability sense”, “stochastically
independent” or, simply, when no misunderstandings are possible, “independent”
if P (A ∩ B) = P (A)P (B).
• If we recall the definition of conditional probability, we see that, in this case,
the conditional probability of each one event to the other is again the “marginal”
probability of the same event.
19.16
Random Variables
• These are functions X(.) from Ω to the real axis R.
• Not all such functions are considered random variables. For X(.) to be a random
variable we require that for any real number t the set Bt given by the points ω
in Ω such that X(ω) ≤ t is also an event, that is: an element of the algebra (or
σ-algebra).
• The reason for this requirement (whose technical name is: “measurability”) is that
a basic tool for modeling the probability of values of X is the “probability distribution function” (PDF) (sometimes “cumulative distribution function” CDF) of
232
X defined for all real numbers t as: FX (t) = P ({ω} : X(ω) ≤ t) = P (Bt ) and,
obviously, in order for this definition to have a meaning, we need all Bt to be
events (that is: a probability P (Bt ) must be assessed for each of them).
19.17
Properties of the PDF
• From its definition we can deduce some noticeable properties of FX
1. it is a non decreasing function;
2. its limit for t going to −∞ is 0 and its limit for t going to +∞ is one;
3. we have: limh↓0 Fx (t + h) = FX (t) but this is in general not true for h ↑ 0 so that
the function may be discontinuous.
• We may have at most a enumerable set of such discontinuities (as they are discontinuities of the first kind).
• Each of these discontinuities is to be understood as a probability mass concentrated on the value t where the discontinuity appears. Elsewhere F is continuous.
19.18
Density and Probability Function
• In order to specify probability models for random variables, usually, we do not
directly specify F but other functions more easy to manipulate.
• We usually consider as most relevant two cases (while interesting mix of these
may appear):
1. the absolutely continuous case, that is: where F shows no discontinuity and can
be differentiated with the possible exception of a set of isolated points
2. the discrete case where F only increases by jumps.
19.19
Density
In the absolutely continuous case we define the probability density function of X as:
X (s)
|s=t where this derivative exists and we complete this function in an
fX (t) = ∂F∂s
arbitrary´way where it does not. Any choice of completion shall have the property:
t
FX (t) = −∞ fX (s)ds.
233
19.20
Probability Function
In the discrete case we call “support” of X the at most enumerable set of values xi
corresponding to discontinuities of F and we indicate this set with Supp(X)and define
the probability function PX (xi ) = FX (xi ) − lim h↑0 FX (xi + h) for all xi : xi ∈ Supp(X)
with the agreement that such a function is zero on all other real numbers. In simpler
but less precise words P (.) is equal to the “jumps” in F (.) on the points xi where these
jumps happen and zero everywhere else.
19.21
Expected Value
The “expected value” of (in general) a function G(X) is then defined, in the continuous
and discrete case as
ˆ+∞
E(G) =
G(s)fX (s)ds
−∞
and
E(G) =
X
G(xi )PX (xi )
xi ∈Supp(X)
If G is the identity function G(t) = t the expected value of G is simply called the
“expected value”, “mathematical expectation”, “mean”, “average” of X.
19.22
Expected Value
• If G is a non-negative integer power: G(X) = X k , we speak of “the k-th moment
of X and usually indicate this with mk or µk .
• If G(X) is the function I(X ∈ A), for a given set A, which is equal to 1 if X =
x ∈ A and 0 otherwise (the indicator function of A) then E(G(X)) = P (X ∈ A).
• In general, when the probability distribution of Xis NOT degenerate (concentrated on a single value x), E(G(X)) 6= G(E(X)). There is a noticeable exception: if G(X) = aX + b with a andb constants. In this case we have
E(aX + b) = aE(X) + b.
• Sometimes the expected value of X is indicated with µX or simply µ.
19.23
Variance
• The “variance” of G(X) is defined as V (G(X)) = E((G(X) − E(G(X))2 ) =
E(G(X)2 ) − E(G(X))2 .
• A noticeable property of the variance is that such that V (aG(X)+b) = a2 V (G(X)).
234
• The square root of the variance is called “standard deviation”. For these two
quantities the symbols σ 2 and σ are often used (with or without the underscored
name of the variable).
19.24
Tchebicev Inequality
• A fundamental inequality which connects probabilities with means and variances
is the so called “Tchebicev inequality”:
P (|X − E(X)| < λσ) ≥ 1 −
1
λ2
• As an example: ifλis set to 2 the inequality gives a probability of at least 75%
for X to be between its expected value + and - 2 times its standard deviation.
• Since the inequality is strict, that is: it is possible to find a distribution for
which the inequality becomes an equality, this implies that, for instance, 99%
probability could require a ± “10 σ” interval.
• For comparison, 99% of the probability of a Gaussian distribution is contained
in the interval µ ± 2.576σ.
• These simple points have a great relevance when tail probabilities are computed
in risk management applications.
• In popular literature about extreme risks, and also in some applied work it is
common to ask for a “six sigma” interval. For such an interval the Tchebicev
bound is 97.(2)%
19.25
*Vysochanskij–Petunin Inequality
Tchebicev inequality can be refined by the Vysochanskij–Petunin inequality which,
with the added hypothesis that the distribution be unimodal, states that, for any
λ > √23 = 1.632993
4
P (|X − µ| < λσ) ≥ 1 − 2
9λ
more than halving the probability outside the given interval given by Tchebicev: the
75% for λ = 2 becomes now 1 − 19 that is 88.(9)%.
Obviously, this gain in precision is huge only if λ is not to big. The fabled “six
sigma” interval according to this inequality contains at least 98.76% probability just
about 1.5% more than Tchebicev.
235
19.26
*Gauss Inequality
This result is an extension of a result by Gauss who stated that if m is the mode (mind
not the expected value: in this is the V-P extension) of a unimodal random variable
then

1 − 2 2 if λ ≥ √2
3λ
3
P (| X − m |< λτ ) ≥
 √λ
if 0 ≤ λ ≤ √23 .
3
Where τ 2 = E(X − m)2 .
19.27
*Cantelli One Sided Inequality
A less well known but useful inequality is the Cantelli ore one sided Tchebicev inequality, which, phrased in a way useful for left tail sensitive risk managers, becomes:
λ2
1 + λ2
of the probability (80%) is above the µ − 2σ
P (X − µ ≥ λσ) ≥
and for λ = −2 this means that at least 45
lower boundary.
For “minus six sigma” this becomes 97.29(9729)%.
19.28
Quantiles
• The “α-quantile” of X is defined as the value qα such that, the following conditions
are simultaneously valid:
P r[X < qα ]≤α
P r[X≤qα ]≥α
• Notice that in the case of a random variable with continuousFX (.) this equation
could be written as qα ≡ inf (t : FX (t) = α) and in the case of a continuous
strictly increasing FX (.) this becomes qα ≡ t : FX (t) = α
• For a non continuous FX (.) in case αis NOT one of the values taken by FX (.) the
above definition corresponds to a value of x ofX with FX (x) > α.
• Due to applications in the definition of VaR it is more proper to use as quantile,
in this case, a qα defined as the maximum value x of X with positive probability
and with FX (x) ≤ α.
• The formal definition of this is rather cumbersome
qα ≡ max {x : [P r[X ≤ x] > P r[X < x]] ∩ [P r[X ≤ x] ≤ α]}
• Which reads “qα is the greatest value of x such that it has a positive probability
and such that FX (x) ≤ α
236
19.29
Median
• If α = 0.5 we call the corresponding quantile the “median” of X and use for it,
usually, the symbol Md .
• It may be interesting to notice that, if G is continuous and increasing, we have
Md (G(X)) = G(Md (X)).
19.30
Subsection
• A mode in a discrete probability distribution (or frequency distribution) is any
value of x ∈ Supp(X) where the probability (frequency) has a local maximum.
• “The mode”, usually, is the global maximum.
• In the case of densities, the same definition is applied in terms of density instead
of probability (frequency).
19.31
Univariate Distributions Models
• Models for univariate distributions come in two kinds: non parametric and parametric.
• A parametric model is a family of functions indexed by a finite set of parameters (real numbers) and such that for any value of the parameters in a predefined
parameter space the functions are probability densities (continuous case) or probability functions (discrete case).
• A non parametric model is a model where The family of distributions cannot be
indexed by a finite set of real numbers.
• It should be noticed that, in many applications, we are not interested in a full
model of the distribution but in modeling only an aspect of it as, for instance,
the expected value, the variance, some quantile and so on.
19.32
Some Univariate Discrete Distributions
• Bernoulli: P (x) = θ, x = 1; P (x) = 1 − θ, x = 0; 0 ≤ θ ≤ 1. You should
notice the convention: the function is explicitly defined only on the support of
the random variable. For the Bernoulli we have: E(X) = θ, V (X) = θ(1 − θ).
• Binomial: P (x) = nx θx (1 − θ)n−x , x = 0, 1, 2, ..., n; 0 ≤ θ ≤ 1. We have:E(X) =
nθ; V (X) = nθ(1 − θ).
237
• Poisson:P (x) = λx e−λ /x!, x = 0, 1, 2, ..., ∞; 0 ≤ λ. We have:E(X) = λ; V (X) =
λ.
• Geometric P (x) = (1 − θ)x−1 θ, x = 1, 2, ..., ∞; 0 ≤ θ ≤ 1. We have E(X) = 1θ ;
V (X) = 1−θ
θ2
19.33
Some Univariate Continuous Distributions
Negative exponential: f (x) = θe−θx , x > 0, θ > 0. We have: E(X) = 1/θ; V (X) =
1/θ2 . (Here you should notice that, as it is often the case for distributions with constrained support, the variance and the expected value are functionally related).
19.34
Some Univariate Continuous Distributions
1
2
1
Gaussian: f (x) = √2πσ
e− 2σ2 (x−µ) , x ∈ R, µ ∈ R, σ 2 > 0. We have E(X) = µ,V (X) =
2
σ 2 . A very important property of this random variable is that, if a and b are constants,
then Y = aX + b is a Gaussian if X is a Gaussian.
By the above recalled rules on the E and V operators we have also that E(Y ) =
aµ+b; V (Y ) = a2 σ 2 . In particular, the transform Z = X−µ
is distributed as a “standard”
σ
(expected value 0, variance 1) Gaussian.
19.35
Some Univariate Continuous Distributions
The distribution function of a standard Gaussian random variable is usually indicated
with Φ, so Φ(x) is the probability of observing values of the random variable X which
are smaller then or equal to the number x, in short: Φ(x) = P (X ≤ x). With
z1−α = Φ−1 (1 − α) we indicate the inverse function of Φ that is: the value of the
standard Gaussian which leaves on its left a given amount of probability. Obviously
Φ(Φ−1 (1 − α)) = 1 − α.
19.36
Random Vector
• A random vector X of size n is a n- dimensional vector function from Ω toRn ,
that is: a function which assigns to each ω ∈ Ω a vector of n real numbers.
• The name “random vector” is better than the name “vector of random variables” in
that, while each element of a random vector is, in fact, a random variable, a simple
vector of random variables could fail to be a random vector if the arguments ωi
of the different random variables are not constrained to always coincide.
• (If you understand this apparently useless subtlety you are well on your road to
understanding random vectors, random sequences and stochastic processes).
238
19.37
Distribution Function for a Random Vector
• Notions of measurability analogous to the one dimensional case are required to
random vectors but we do not mention these here.
• Just as in the case of random variable, we can define probability distribution
functions for random vectors as FX (t1 , t2 , ..., tn ) = P ({ω} : X1 (ω) ≤ t1 ,X2 (ω) ≤
t2 , ..., Xn (ω) ≤ tn ) where the commas in this formulas can be read as logical “and”
and, please, notice again that the ω for each element of the vector is always the
same.
19.38
Density and Probability Function
As well as in the one dimensional case, we usually do not model a random vector by
specifying its probability distribution function but its probability function: P (x1 , ..., xn )
or its density: f (x1 , ..., xn ), depending on the case.
19.39
Marginal Distributions
• In the case of random vectors we may be interested in “marginal” distributions,
that is: probability or density functions of a subset of the original elements in
the vector.
• If we wish to find the distribution of all the elements of the vector minus, say,
the i-th element we simply work like this:
• in the discrete case
P (x1 , ..., xi−1 , xi+1 , ...xn ) =
X
P (x1 , ..., xi−1 , xi , xi+1 ...xn )
xi ∈Supp(Xi )
• and in the continuous case:
ˆ
f (x1 , ..., xi−1 , xi+1 , ...xn ) =
f (x1 , ..., xi−1 , xi , xi+1 ...xn )dxi
xi ∈Supp(Xi )
• We iterate the same procedures for finding other marginal distributions.
19.40
Conditioning
• Conditional probability functions and conditional densities are defined just like
conditional probabilities for events.
239
• Obviously, the definition should be justified in a rigorous way but this is not
necessary, for now!
• The conditional probability function of, say, the first i elements in a random
vector given, say, the other n − i elements shall be defined as:
P (x1 , ..., xi |xi+1 , ...xn ) =
P (x1 , ..., xn )
P (xi+1 , ...xn )
• For the conditional density we have:
f (x1 , ..., xi |xi+1 , ...xn ) =
f (x1 , ..., xn )
f (xi+1 , ...xn )
• In both formulas we suppose denominators to be non zero.
19.41
Stochastic Independence
• Two sub vectors of a random vector, say: the first i and the other n − i random
variables, are said to be stochastically independent if the joint distribution is the
same as the product of the marginals or, that is the same under our definition, if
the conditional and marginal distribution coincide.
• We write this for the density case, for the probability function is the same:
f (x1 , ..., xn ) = f (x1 , ..., xi )f (xi+1 , ..., xn )
f (x1 , ..., xi |xi+1 , ...xn ) = f (x1 , ..., xi )
• This must be true for all the possible values of the n elements of the vector.
19.42
Mutual Independence
• A relevant particular case is that of a vector of mutually independent (or simply
independent) random variables. In this case:
Y
f (x1 , ..., xn ) =
fXi (xi )
i=1,...,n
• Again, this must be true for all possible (x1 , ..., xn ). (Notice: the added big
subscript to the uni dimensional density to distinguish among the variables and
the small cap xi which are possible values of the variables).
240
19.43
Conditional Expectation
• Given a conditional probability function P (x1 , ..., xi |xi+1 , ...xn ) or a conditional
density f (x1 , ..., xi |xi+1 , ...xn ) we can define conditional expected values of, in
general, vector valued functions of the conditioned random variables.
• Something the like of E(g(x1 , ..., xi )|xi+1 , ...xn )) (the expected value is defined
exactly as in the uni dimensional case by a proper sum/series or integral operator).
19.44
Conditional Expectation
• It is to be understood that such expected value is a function of the conditioning
variables. If we understand this it should be not a surprise that we can take
the expected value of a conditional expected value. In this case the following
property is of paramount relevance:
E(E(g(x1 , ..., xi )|xi+1 , ...xn )) = E(g(x1 , ..., xi ))
• Where, in order to understand the formula, we must remember that the outer
expected value in the left hand side of the identity is with respect to (wrt) the
marginal distribution of the conditioning variables vector: (xi+1 , ...xn ), while the
inner expected value of the same side of the identity is wrt the conditional distribution. Notice that, in general, this inner expected value: E(g(x1 , ..., xi )|xi+1 , ...xn )
is a function of the conditioning variables (the conditioned variables are “integrated out” in the operation of taking the conditional expectation) so that it is
meaningful to take its expected value with respect to the conditioning variables.
• The expected value on the right hand side, however, is with respect to the
marginal distribution of the conditioned variables (x1 , ..., xi ).
19.45
Conditional Expectation
• To be really precise we must say that the notation we use (small printed letters
for both the values and the names of the random variables) is approximate: we
should use capital letters for variables and small letters for values. However we
follow the practice that usually leaves the distinction to the discerning reader.
19.46
Law of Iterated Expectations
• The above property is called “law of iterated expectations” and can be written in
much more general ways.
241
• In the simplest case of two vectors we have: EY (EX|Y (X|Y)) = EX (X). For the
conditional expectation value, wrt the conditioned vector, all the properties of
the marginal expectation hold.
19.47
Regressive Dependence
• Regression function and regressive dependence.
• Being a function of Y, the conditional expectation EX|Y (X|Y) is also called
“regression function” of X on Y. Analogously, EY|X (Y|X) is the regression function of Y on X. If EX|Y (X|Y) is constant wrt Y we say that X is regressively
independent on Y.
• If EY|X (Y|X) is independent of X we say that Y is regressively independent on
X.
• Regressive dependence/independence is not a symmetric concept: it can hold on
a side only.
• Moreover, stochastic independence implies two sided regressive independence,
again, the converse is not true.
• A tricky topic: conditional expectation is, in general, a “static” concept. For any
GIVEN set of values of, say, Y you compute EX|Y (X|Y). However, implicitly,
the term “regression function” implies the possibility of varying the values of the
conditioning vector (or variable). This must be taken with the utmost care as it
is at the origin of many misunderstandings, in particular with regard to “causal
interpretations” of conditional expectations. The best, if approximate, idea to
start with is that EX|Y (X|Y) gives us a “catalog” of expected values each valid
under given “conditions” Y, be it or not be it possible or meaningful to “pass”
from one set of values of Y to another set.
19.48
Covariance and Correlation
• The covariance between two random variablesX and Y is defined as: Cov(X, Y) =
E(XY) − E(X)E(Y).
• From the above definition we get that, for any set of constants a, b, c, d Cov(a + bX, c + dY) =
bdCov(X, Y).
• An
p important result (Cauchy inequality) allows us to show that |Cov(X, Y)| ≤
V (X)V (Y). From this we derive a “standardized
covariance” called “correlation
p
coefficient”: Cor(X, Y) = Cov(X, Y)/ V (X)V (Y).
242
• We have Cor(a + bX, c + dY) = Sign(bd)Cor(X, Y).
• The square of the correlation coefficient is usually called R square or rho square.
• Notice that, regressive independence, even only unilateral, implies zero covariance
and zero correlation, the converse, however, is in general not true.
19.49
Distribution of the max and the min for independent
random variables
• Let {X1 , ..., Xn } be independent random variables with distribution functions
FXi (.).
• Let X(1) = max{X1 , ..., Xn } and X(n) = min{X1 , ..., Xn }.
Q
Q
• Then FX(1) (t) = ni=1 FXi (t) and FX(n) (t) = 1 − ni=1 (1 − FXi (t)).
• If the random variables are also identically distributed we have
FX(1) (t) =
n
Y
FXi (t) = F n (t)
i=1
and
FX(n) = 1 −
.
19.50
n
Y
(1 − FXi (t)) = 1 − (1 − F (t))n
i=1
Distribution of the max and the min for independent
random variables
• Why? Consider the case of the max. FX(1) (t) is , by definition, the probability
that the value of the max among the n random variables is less than or equal to
t.
• But the max is less than or equal t if and only if each random variable is less
than or equal to t.
• Since they are independent this is givenQby the product of the FXi each computed
at the same point t, that is FX(1) (t) = ni=1 FXi (t).
• For the min: 1 − FX(n) (t) is the probability that the min is greater that t. But
this is true if and only if each of the n random variables has a value greater than t
and for each random variable this probability is 1 − FXi (t). they are independent,
so...
243
19.51
Distribution of the sum of independent random variables
and central limit theorem
• Let {X1 , ..., Xn } be independent random variables. Let Sn =
sum.
Pn
i=1
Xi be their
2
• We know that,Pif each random variable
Pn has2 expected value µi and variance σi ,
n
then E(Sn ) = i=1 µi and V (Sn ) = i=1 σi .
• To be more precise: the first property is always valid, whatever the dependence,
provided the expected values exist, while the second only requires zero correlation
(provided the variances exist).
• Can we say something about the distribution of Sn ?
• If we knew the distributions of the Xi we could (but this could be quite cumbersome) compute the distribution of the sum.
• However, if we do not know (better: do not make hypotheses on) the distributions
of the Xi we still can give proof to a powerful and famous result which, in its
simplest form, states:
19.52
Distribution of the sum of independent random variables
and central limit theorem
• Let {X1 , ..., Xn } be iid random variables with expected value µ and variance σ 2 .
Then
!
Sn
−
µ
lim P r n √ ≤ t = Φ(t)
n→∞
σ/ n
Where, as specified above, Φ(.) is the PDF of a standard Gaussian.
• In practice this means that, under the hypotheses of this theorem, if “n is big
enough ” (a sentence whose meaning should be, and can be, made precise) we
s
−µ
n√
).
can approximate FSn (s) with Φ( σ/
n
19.53
Distribution of the sum of independent random variables
and central limit theorem
• More general versions of this theorem, with non necessarily identically distributed
or even non independent Xi exist.
244
• This result is fundamental in statistical applications where confidence levels for
confidence intervals of size of errors for tests must be computed in non standard
settings.
Statistical inference
19.54
Why Statistics
• Probabilities are useful when we can specify their values. As we did see above,
sometimes, in finite settings, (coin flipping, dice rolling, card games, roulette, etc.)
it is possible to reduce all probability statement to simple statements judged, by
symmetry properties, equiprobable.
• In these case we say we “know” probabilities (at least in the sense we agree on
its values and, as a first approximation, do not look for some “discovery rule” for
probabilities) and use these for making decisions (meaning: betting). In other
circumstances we are not so lucky.
• This is obvious when we consider betting on horse racing, computing insurance
premia, investing in financial securities. In all these fields “symmetry” statements
are not reasonable.
• However, from the didactic point of view, it is useful to show that the ”problem”
is there even with simple physical “randomizing devices” when their “shape” does
not allow for simple symmetry statements.
• Consider for instance rolling a pyramidal “die”: this is a five sided object with
four triangular sides a one square side. In this case what is the probability
for each single side to be the down side? For some news on dice see http:
//en.wikipedia.org/wiki/Dice
19.55
Unknown Probabilities and Symmetry
• The sides are not identical, so the classical argument for equiprobability does not
hold. We may agree that the probability of each triangular face is the same as
the dice is clearly symmetric if seen with the square side down. But then: what
is the total value of these four probabilities? Or, that is the same, what is the
probability for the square face to be the down one?
• Just by observing different pyramidal dice we could surmise that the relative
probability of the square face and of the four triangular faces depend, also, on
the effective shape of the triangular faces. We could hypothesize, perhaps, that
245
the greater is the eight of such faces, the bigger the probability for a triangular
face to be the down one in comparison to the probability for the square face.
19.56
Unknown Probabilities and Symmetry
• With skillful physical arguments we could come up with some quantitative hypotheses, we understand, however, that this shall not be simple. With much
likelihood a direct observation of the results from a series of actual rolls of this
dice could be very useful.
• For instance we could observe, not simply hypothesize, that (for a pyramid made
of some homogeneous substance) the more peaked are the triangular sides (and
so the bigger their area for a given square basis of the pyramid) the smaller the
probability for the square side to be the one down after the throw. We could also
observe, directly or by mind experiment, that the “degenerate” pyramid having
height equal to zero is, essentially, a square coin so that the probability of each side
(the square one and the one which shall transform in the four triangles) should
be 1/2. From these two observations and some continuity argument we could
conclude that there should be some unknown height such that the probability of
falling on a triangular side is, say, 1 > c ≥ 1/2 and, by symmetry, the probability
for each triangular side, is c/4. This provided there is no cheating in throwing
the die so that the throw is “chaotic” enough. So, beware of magicians!
• What is interesting is that this “mental+empirical” analysis gives us a possible
probability model for the result of throwing our pyramidal die. Moreover, this
model could be enriched by some law connecting c with the height of the pyramid.
Is c proportional to the height? Proportional to the square of the height? To the
square root of the height? As we shall see in what follows Statistics could be a
tool for choosing among these hypotheses.
• By converse, suppose you know, from previous analysis, that, for a pyramid made
of homogeneous material, a good approximation is c proportional to the height.
In this case a good test to assess the homogeneity of the material with which the
pyramid is made could be that of throwing several pyramids of different height
and see if the ratio between the frequency of a triangular face and the height of
the pyramid is a constant.
19.57
No Symmetry
• Consider now a different example: horse racing. Here the event whose probability
we are interested in is, to be simple, the name of the winner.
246
• It is “clear” that symmetry arguments here are useless. Moreover, in this case
even the use of past data cannot mimic the case of the pyramid, while observation
of past races results could be relevant, the idea of repeating the same race a
number of times in order to derive some numerical evaluation of probability is
both unpractical and, perhaps, even irrelevant.
19.58
No Symmetry
• What we may deem useful are data on past races of the contenders, but these
data regard different track conditions, different tracks and different opponents.
• Moreover they regard different times, hence, a different age of the horse(s), a
different period in the years, a different level of training, and so on.
• History, in short.
• This not withstanding, people bet, and bet hard on such events since immemorial
past. Where do their probabilities come from?
• An interesting point to be made is that, in antiquity, while betting was even more
common than it is today (in many cultures it had a religious content: looking for
the favor of the gods), betting tools, like dice existed in a very rudimentary form
with respect to today. We know examples of fantastically “good” dice made of
glass or amber (many of these being not used for actual gambling but as offers
to the Deity). These are very rare. The most commonly used die came from a
roughly cubic bone off a goat or a sheep. In this case symmetry argument were
impossible and experience could be useful.
• An interesting anthropological fact is that in classical times gambling was very
common, the concept of chance and luck were so widespread to merit specific
deities. However no hint of any kind of “uncertainty quantification” is known, with
the exception of some side comment. Why this is the case is a mystery. It may
be that the religious content mentioned above made in some sense blasphemous
the idea of quantifying chance, but this is only an hypothesis.
19.59
Learning Probabilities
• Let us sum up: probability is useful for taking decision (betting) when the only
unknown is the result of the game.
• This is the typical case in simple games of chance (not in the, albeit still simple,
pyramidal dice case).
247
• If we want to use probability when numerical values for probability are not easily derived, we are going to be uncertain both on uncertain results and on the
probability of such results.
• We can do nothing (legal) about the results of the game, but we may do something
for building some reasonable way for assessing probabilities. In a nutshell this is
the purpose of Statistics.
• The basic idea of statistic is that, in some cases, we can “learn” probabilities from
repeated observations of the phenomena we are interested in.
• The problem is that for “learning” probabilities we need ... probabilities!
19.60
Pyramidal Die
• Let us work at an intuitive level on a specific problem. Consider this set of basic
assumptions concerning the pyramidal die problem.
• We may agree that the probability for each face to be the down one in repeated
rollings of the die is constant, unknown but constant.
• Moreover, we may accept that the order with which results are recorded is, for
us, irrelevant as “experiments” (rolls of the dice) are made always in the same
conditions.
• We, perhaps, shall also agree that the probability of each triangular face is the
same.
19.61
Pyramidal Die Model
• Well: we now have a “statistical model”. Let us call θi , i = 1, 2, 3, 4 the probabilities of each triangular face.
• This are going to be non negative numbers (Probability Theory require this)
moreover, if we agree with the statement about their identity, each of these value
must be equal to the same θ so the total probability for a triangular face to be
the down one shall be 4θ.
• By the rules of probability, the probability for the square face is going to be 1−4θ
and, since this cannot be negative, we need θ ≤ .25 (where we perhaps shall avoid
the equal part in the ≤sign).
• If we recall the previous analysis we should also require θ ≥ 1/8.
248
19.62
Pyramidal Die Constraints
• All these statements come from Probability Theory joint with our assumptions
on the phenomenon we are observing.
• In other, more formal, words we specify a probability model for each roll of the
die and state this:
• In each roll we can have a result in the range 1,2,3,4,5;
• The probability of each of the first four values is θ and this must be a number
not greater than .25.
• With just these words we have hypothesized that the probability distribution of
each result in a single toss is an element of a simple but infinite and very specific
set of probability distributions completely characterized by the numerical value
of the “parameter” θ which could be any number in the “parameter space” given
by the real numbers between 1/8 and 1/4 (left extreme included if you like).
19.63
Many Rolls
• This is a model for a single rolling. But, exploiting our hypotheses, we can easily
go on to a model for any set of rollings of the dice.
• In fact, if we suppose, as we did, that each sequence of results of given length has
a probability which only depends on the number of triangular and square faces
observed in the series (in technical terms we say that the observation process
produces an “exchangeable” sequence of results, that is: sequences of results
containing the same number of 5 and non 5 have the same probability).
• Just for simplicity in computation let us move on a step: we shall strengthen our
hypothesis and actually state that the results of different rollings are stochastically independent (this is a particular case of exchangeability that is: implies but
is not implied by exchangeability).
19.64
Probability of Observing a Sample
• Under this hypothesis and the previously stated probability model for each single
roll, the joint probability of a sample of size n, were we only record 5s and not
5s, is just the product of the probabilities for each observation.
• In our example: suppose we roll the dice 100 times and observe 40 times 5
(square face down) and 60 times either 1 or 2 or 3 or 4, since each of these
faces is incompatible with the other and each has probability θ, the probability
of “either 1 or 2 or 3 or 4” is 4θ.
249
• The joint probability of the observed sample is thus (4θ)60 (1 − 4θ)40 .
19.65
Pre or Post Observation?
But here there is a catch, and we must understand this well: are we computing the
probability of a possible sample before observation, or the probability of the observed
sample? In the first case no problems, the answer is correct, but, in the second, we
must realize that the probability of observing the observed sample is actually one, after
all we DID observe it!
• Let us forget, for the moment, this subtlety which is going to be relevant in what
follows. We have the probability of the observed sample, since the sample is
given, the only thing in the formula which can change value is the parameter θ.
• The probability of observing the given sample shall, in general, be a function of
this parameter.
19.66
Maximize the Probability of the Observed Sample
• The value which maximizes the probability of the observed sample among the
possible values of θ is (check it!) θb =60/400=3/20=.15
• Notice that this value maximizes (4θ)60 (1 − 4θ)40 : the probability of observing
the given sample (or any
specific sample containing 40 5s and 60 non 5s)
given
100
60
but also maximizes 40 (4θ) (1 − 4θ)40 that is: the probability of observing A
sample in the set of samples containing 40 5s and 60 non 5s. (Be careful in
understanding the difference between
“the
given sample ” and “A sample in the
100
100
set”, moreover notice that 40 = 60 ).
19.67
Maximum Likelihood
• Stop for a moment and fix some points. What did we do, after all? Our problem
was to find a probability for each face of the pyramidal dice. The only thing we
could say a priori was that the probability of each triangular face was the same.
From this and simple probability rules we derived a probability model for the
random variable X whose values are 1, 2, 3, 4 when the down face is triangular,
and 5 when it is square.
• We then added an assumption on the sampling process: observations are iid
(independent and identically distributed as X). The two assumptions constitute
a “statistical model” for X and are enough for deriving a strategy for “estimating”
θ (the probability of any given triangular face).
250
• The suggested estimate is the value θb which maximizes the joint probability
of observing the sample actually observed. In other words we estimated the
unknown parameter according to the maximum likelihood method.
19.68
Sampling Variability
• At this point we have an estimate of θ and the first important point is to understand that this actually is just an estimate, it is not to be taken as the “true”
value of θ.
• In fact, if we roll the dice another 100 times and compute the estimate with the
same procedure, most likely, a different estimate shall come up and for another
sample, another one and so on and on.
• Statisticians do not only find estimates, most importantly they study the worst
enemy of someone which must decide under uncertainty and unknown probabilities: sampling variability.
19.69
Possibly Different Samples
• The point is simple: consider all possible different samples of size 100. Since, as
we assumed before, the specific value of a non 5 is irrelevant, let us suppose, for
simplicity, that all that is recorded in a sample is a sequence of 5s and non 5s.
• Since in each roll we either get a 5 or a non 5 the total number of these possible
samples is 2100 .
• On each of these samples our estimate could take a different value, consider,
however, that the value of the estimate only depends of how many 5 and non 5
were observed in the specific sample (the estimate is the number of non 5 divided
by 4 times 100).
• So the probability of observing a given value of the estimate is the same as the
probability of the set of samples with the corresponding number of 5s.
19.70
The Probability of Our Sample
• But it is easy to compute this probability: since by our assumptions on the
statistical model, every sample containing the same number of 5s (and so of non
5s) has the same probability, in order to find this probability we can simply
compute the probability of a generic sample of this kind and multiply it times
the number of possible samples with the same number of 5s.
251
• If the number of 5s is, say, k we find that the probability of the generic sample
with k 5s and 100-k non 5s is (see above): (4θ)100−k (1 − 4θ)k .
19.71
The Probability of a Similar Estimate
• This is the same for any sample with k 5 and 100-k non 5. There are many
samples of this kind, depending on the order of results. The number of possible
samples of this kind can be computed in this simple way: we must put k 5s in a
sequence of 100 possible places.
• We can insert the first 5 in any of 100 places, the second in any of 99 and so on.
100!
however there are k ways to choose the
• We get 100 ∗ 99 ∗ ... ∗ (100 − k) = (100−k)!
first 5 k − 1 for the second and so on up to 1 for the k th and for all these k! ways
(they are called “combinations” the sample is always the same, so the number of
100!
different samples is k!(100−k)!
.
= 100
k
19.72
The Probability of a Similar Estimate
• This is the number of different sequences of “strings” of 100 elements each containing k 5s and 100-k non 5s.
• Summing up: the probability of observing k 5s on 100
hence of computing
rolls,
100
100−k
and estimate of θ equal to k/400, is precisely: k (4θ)
(1 − 4θ)k (which is
a trivial modification of the binomial).
19.73
The Probability of a Similar Estimate
• So, before sampling, for any possible “true” value of θ we have a different probability for each of the (100 in this case) possible values of the estimate.
• The reader shall realize that, for each given value of θ the a priori (of sampling)
most probable value of the estimate is the one corresponding to the integer number of 5s nearest to 100(1 − 4θ) (which in general shall not be integer).
19.74
The Estimate in Other Possible Samples
• Obviously, since this is just the most probable value of the estimate if the probability is computed with this θ, it is quite possible, it is in fact very likely, that
a different sample is observed.
252
this immediately implies that, in
• Since our procedure is to estimate θ with 100−k
400
the case the observed sample is not the most probable for that given θ, the value
of the estimate shall NOT be equal to θ, in other words it shall be “wrong” and
the reason of this is the possibility of observing many different samples for each
given “true” θ, that is: sampling variability.
• In general, using the results above, for any given θ, the probability
observing a
of n−k
n
sample of size n which gives as an estimate n−k
is
(as
above)
(4θ)
(1 − 4θ)k
4n
k
19.75
The Estimate in Other Possible Samples
• So, for instance, the probability, given this value of θ, of observing a sample such
is equal to the parameter value, is, if we
that, for instance, the estimate n−k
4n
suppose that the value for θ which we use in computing this probability can be
written as n−k
(otherwise the probability is 0 and we must use intervals of values)
4n
n − k n−k
n − k k n n − k n−k
n−k k
(4
) (1 − 4
) =
(
) (1 −
)
k
4n
4n
k
n
n
n
• Due to what we did see above the value n−k
is the most probable value of the
4n
but
many
other
values
may have sizable probability so
estimate when θ = n−k
4n
n−k
that, eve if the “true value” is θ = 4n it is possible to observe estimates different
than n−k
with non negligible probability.
4n
19.76
Sampling Variability
• The study of the distribution of the estimate given θ is called the study of the
“sampling variability” of the estimate: the attitude of the estimate to change in
different samples and can be done in several different ways.
• For instance, using again our example, we see clearly that there does not exist a
single “sampling distribution” of the estimate as there is one for each value of the
parameter.
• On one hand this is good, because otherwise the estimate would give us quite
poor information about θ: the information we get from the estimate comes exactly
from the fact that for different values of θ different values of the estimate are more
likely to be observed.
• On the other it does not allow us to say which is the “sampling distribution” of
the estimate but only gives us a family of such distribution.
253
19.77
Sampling Variability
• However, even if we do not know the value of the parameter we may study several
aspects of the sampling distribution.
• For instance, for each θ we can compute, given that θ the expected value of the
estimate for the distribution of the estimate with that particular value of θ. In
other words we could compute
n
X
n − k n
(4θ)n−k (1 − 4θ)k
4n
k
k=0
and by doing this computation we would see that the result is θ itself, no matter
which value has θ. So that we say that the estimate is unbiased.
19.78
Sampling Variability
• Again, for each θ we can compute the variance of the of the estimate for the
distribution of the estimate with that particular value of θ. That is, we could
compute
n
X
4θ(1 − 4θ)
n − k 2 n
)
(4θ)n−k (1 − 4θ)k − θ2 =
(
4n
k
4n
k=0
the “sampling variance” of the estimate, and see that, while this is a function of
θ (whose value is unknown to us) for any value of θ it goes to 0 when n goes
to infinity. This, joint with the above unbiasedness result, implies (Tchebicev
inequality) that the probability of having
n−k
∈ [θ ± c]
4n
that is: of observing a value of the estimate different than θ at most of c, goes to 1
for ANY c > 0 no matter the value of θ. This is called “mean square consistency”.
19.79
Sampling Variability
• A curiosity. In typical applications the sampling variance depends on the unknown parameter(s).
• While any reasonable estimate must have a sampling distribution depending on
the unknown parameter(s) there are cases where the sampling variance could be
independent on unknown parameter(s).
254
• For instance, in iid sampling from an unknown distribution with unknown expected value µ and known standard deviation σ the usual estimate of µ, the
2
arithmetic mean of the data, has a sampling variance equal to σn which does not
depend on unknown parameters (repeat: we assumed σ known).
19.80
Estimated Sampling Variability
• In the end, if, say we wish for some “number” for the sampling variance when, as
in our case, it depends on the unknown parameter and not the simple formula
4θ(1−4θ)
or some specific distribution in the place of the family of distributions
4n , n−k
n
(4θ) (1 − 4θ)k we could “estimate” these substituting in the formula the
k
estimate of θ to the unknown value θ̂ = n−k
and get
4n
θ̂)
n
n−k
• V̂ (θ̂) = 4θ̂(1−4
and
P̂
(
θ̂
=
)
=
(4θ̂)n−k (1 − 4θ̂)k and always remember to
4n
4n
k
notice the “hats” on V and P .
19.81
Quantifying Sampling Variability
• Whatever method we use for dealing with sampling variability the point is to
face it
• We could find different procedures for computing our estimate, however, for the
same reason (for each given true value of θ many different samples are possible) any reasonable estimate always a sampling distribution (in reasonable cases
depending on θ), so we would in any case face the same problem:sampling variability.
• The point is not to avoid sampling variability but to live with it. In order to do
this it is better to follow some simple principles.
• Simple, yes, but so often forgotten, even by professionals, as to create most
problems encountered in practical applications of Statistics.
19.82
Principle 1
• The first obvious principle to follow in order to be able to do this is: “do not
forget it”.
• An estimate is an estimate is an estimate, it is not the “true” θ.
• This seems obvious but errors of this kind are quite common: it seems human
brain does not like uncertainty and, if not properly conditioned, it shall try in
any possible way, to wrongly believe that we are sure about something on which
we only posses some clue.
255
19.83
Principle 2
• The second principle is “measure it”.
• An estimate (point estimate) by itself is almost completely useless, it should
always be supplemented with information about sampling variability.
• At the very least information about sampling standard deviation should be added.
Reporting in the form of confidence intervals could be quite useful.
• This and not point estimation is the most important contribution Statistics may
give to your decisions under uncertainty.
19.84
Principle 3
• The third principle is “do not be upset by it”.
• Results of decision may upset you even under certainty. This is obviously much
more likely when chance is present even if probabilities are known.
• We are at the third level: no certainty, chance is present, probabilities are unknown!
• The best Statistics can only guarantee an efficient and logically coherent use of
available information.
• It does not guarantee Luck in “getting the right estimates” and obviously it cannot
guarantee that, even if probabilities are estimated well something very unlikely
does not happen! (And no matter what, People shall always expect, forgive the
joke, that what is most probable is much more likely than it is probable).
19.85
The Questions of Statistics
• This long discussion should be useful as an introduction to the statistical problem:
• why we need to do inference and do not simply use Probability?
• what can we expect from inference?
• Now let us be a little more precise.
256
19.86
Statistical Model
• This is made of two ingredients.
• The first is a probability model for a random variable (or more generally a random
vector, but here we shall consider only the one dimensional case).
• This is simply a set of distributions (probability functions or densities) for the
random variable of interest. The set can be indexed by a finite set of numbers
(parameters) and in this case we speak of a parametric model. Otherwise we
speak of a non parametric model.
• The second ingredient is a sampling model that is: a probabilistic assessment
about the joint distribution of repeated observation on the variable of interest.
• The simplest example of this is the case of independent and identically distributed
observations (simple random sampling).
19.87
Specification of a Parametric Model
• Typically a parametric mode is specified by choosing some functional form for
the probability or density function (here we use the symbol P for both) of the
random variable X say: X
P (X; θ) and a set of possible values for θ : θ ∈ Θ(in
the case of a parametric model).
• Sometimes we do not fully specify P but simply ask, for instance, for Xto have
a certain expected value or a certain variance.
19.88
Statistic
• A fundamental concept is that of “estimate” or “statistic”. Given a sample: X
and estimate is simply a function of the sample and nothing else: T (X).
• In other words it cannot depend on unknowns the like of parameters in the model.
Once the sample is observed the estimate becomes a number.
19.89
Parametric Inference
• When we have a parametric model we typically speak about “parametric inference”, and we are going to do so here.
• This may give the false impression that statistician are interested in parameter
values.
257
• Sometimes this may be so but, really, statisticians are interested in assessing
probabilities for (future) values of X, parameters are just “middlemen” in this
endeavor.
19.90
Different Inferential Tools
• Traditionally parametric inference is divided in three (interconnected) sections:
• Point estimation;
• Interval estimation;
• Hypothesis testing.
19.91
Point Estimation
• In point estimation we try to find an estimate T (X) for the unknown parameter
θ (the case of a multidimensional parameter is completely analogous).
• In principle, any statistic could be an estimate, so we discriminate between good
and bad estimates by studying the sampling properties of these estimates.
• In other words we try to asses whether a given estimate sampling distribution
(that is, as we did see before, the probability distribution of the possible values
of the statistic as induced by the probabilities of the different possible samples)
enjoys or not a set of properties we believe useful for a good estimate.
19.92
Unbiasedness
• An estimate T (X) is unbiased for θ iff Eθ (T (X)) = θ, ∀θ ∈ Θ. In order to
understand the definition (and the concept of sampling distribution) is important
to realize that, in general, the statistic T has a potentially different expected value
for each different value of θ (hence each different distribution of the sample).
• What the definition ask for is that this expected value always corresponds to the
θ which indexes the distribution used for computing the expected value itself.
19.93
Mean Square Error
• We define the mean square error of an estimate T as: M SEθ (T ) = Eθ ((T − θ)2 )
.
• Notice how, in this definition, we stress the point that the M SE is a function of
θ (just like the expected value of T ).
258
• We recall the simple result:
Eθ ((T − θ)2 ) = Eθ ((T − Eθ (T ) + Eθ (T ) − θ)2 ) =
= Eθ ((T − Eθ (T ))2 ) + (Eθ (T ) − θ)2
where the first term in the sum is the sampling variance of the estimate and the
second is the “bias”.
• Obviously, for an unbiased estimate, M SE and sampling variance are the same.
19.94
Mean Square Efficiency
• Suppose we are comparing two estimates for θ, say: T1 and T2 .
• We state that T1 is not less efficient than T2 if and only if M SEθ (T1 ) ≤ M SEθ (T2 )
∀θ ∈ Θ.
• As is the case of unbiasedness the most important point is to notice the “for all”
quantifier (∀).
• This implies, for instance, that we cannot be sure, given two estimates, whether
one is not worse than the other under this definition.
• In fact it may well happen that mean square errors, as functions of the parameter
“cross”, so that one estimate is “better” for some set of parameter values while
the other for a different set.
• In other words, the order induced on estimates by this definition is only partial.
19.95
Meaning of Efficiency
If an estimate is T1 satisfies this definition wrt another estimate T2 , this means (use
Tchebicev inequality and the above decomposition of the mean square error) that it
shall have a bigger (better: not smaller) probability of being “near” θ for any value of
this parameter, than T2 .
19.96
Mean Square Consistency
• Here we introduce a variation. Up to now properties consider only fixed sample
sizes. here, on the contrary, we consider the sample size n as a variable.
• Obviously, since an estimate is defined on a given sample, this new setting requires
the definition of a sequence of estimates and the property we are about to state
is not a property of an estimate but of a sequence of estimates.
259
19.97
Mean Square Consistency
• A sequence {Tn } of estimates is termed “mean square consistent if and only if
lim M SEθ (Tn ) = 0, ∀θ ∈ Θ.
n→∞
• You should notice again the quantifier on the values of the parameter.
• Given the above decomposition of the M SE the property is equivalent to the
joint request: lim Eθ (Tn ) = θ, ∀θ ∈ Θ and lim Vθ (Tn ) = 0, ∀θ ∈ Θ.
n→∞
n→∞
• Again, using Tchebicev, we understand that the requirement implies that, for
any given value of the parameter, the probability of observing a value of the
estimate in any given interval containing θ goes to 1 if the size of the sample goes
to infinity.
19.98
Methods for Building Estimates
We could proceed by trial and error: this would be quite time consuming. better
to devise some “machinery” for creating estimates which can reasonably expect to be
“good” in at least some of the above defined senses.
19.99
Method of Moments
• Suppose we have a iid (to be simple) sample X from a random variable X distributed according to some (probability or density) P (X; θ) θ ∈ Θ where the
parameter is, in general, a vector of k components.
• Suppose, moreover, X has got, say, n moments E(X m ) with m = 1, ..., n.
• In general we shall have E(X m ) = gm (θ) that is: the moments are functions of
the unknown parameters.
19.100
Estimation of Moments
• Now, under iid sampling, it is very easy to estimate moments in a way that is, at
least, unbiased and mean square consistent (and also, under proper hypotheses,
efficient).
m
b m) = P
• In fact the estimate: E(X
i=1,...,n X /n that is: the m−th empirical
moment is immediately seen to be unbiased, while its MSE (the variance, in this
m
case) is V (Xn ) which (if it exists) obviously goes to 0 if the size n of the sample
goes to infinity.
260
19.101
Inverting the Moment Equation
• The idea of the method of moment is simple. Suppose for the moment that θ is
one dimensional.
• Choose any gm and suppose it is invertible (if the model is sensible, this should
be true. Why?).
• Estimate the correspondent moment of order m with the empirical
the
Pmoment of
−1
m
b
same order and take as an estimate of θ the function θm = gm ( i=1,...,n X /n).
• In the case of k parameter just solve with respect to the unknown parameter a
system on k equation connecting the parameter vector with k moments estimated
with the corresponding empirical moments.
19.102
Problems
• This procedure is intuitively alluring. However we have at least two problem.
The first is that any different choice of moments is going to give us, in general,
a different estimate (consider for instance the negative exponential model and
estimate its parameter using different moments).
• The Generalized Method of Moments tries to solve this problem (do not worry!
this is something you may ignore, for the moment).
• The second is that, while empirical moments under iid sampling are, for instance,
unbiased estimates of corresponding theoretical moments, this is usually not true
for method of moments estimates. This is due to the fact that the gm we use are
typically not linear.
• Under suitable hypotheses we can show that method of moments estimates are
means square consistent but this is usually all we can say.
19.103
Maximum Likelihood
• Maximum likelihood method (one of the many inventions of Sir R. A. Fisher: the
creator of modern mathematical Statistics and modern mathematical genetics).
• Here the idea is clear if we are in a discrete setting (i.e. if we consider a model
of a probability function).
• The first step in the maximum likelihood method is to build the joint distribution
of the sample.
Q
• In the context described above (iid sample) we have P (X; θ) = i P (Xi ; θ).
261
• Now, observe the sample and change the random variables in this formulas (Xi )
into the corresponding observations (xi ).
• The resulting P (x; θ) cannot be seen as a probability of the sample (the probability of the observed sample is, obviously, 1), but can be seen as a function of θ
given the observed sample: Lx (θ) = P (x; θ).
19.104
Maximum Likelihood
• We call this function the “likelihood”.
• It is by no means a probability, either of the sample or of θ, hence the new name.
• The maximum likelihood method suggests the choice, as an estimate of θ, of the
value that maximizes the likelihood function given the observed sample,formally:
θbml = arg maxLx (θ).
θ∈Θ
19.105
Interpretation
• If P is a probability (discrete case) the idea of the maximum likelihood method
is that of finding the value of the parameter which maximizes the probability of
observing the actually a posteriori observed sample.
• The reasoning is exactly as in the example at the beginning of this section.
• While for each given value of the parameter we may observe, in general, many
different samples, a set of these (not necessarily just one single sample: many
different samples may have the same probability) has the maximum probability
of being observed given the value of the parameter.
19.106
Interpretation
• We observe the sample and do not know the parameter value so, as an estimate,
we choose that value for which the specific sample we observe is among the most
probable samples.
• Obviously, if , given the parameter value, the sample we observe is not among the
most probable, we are going to make a mistake, but we hope this is not the most
common case and we can show, under proper hypotheses, that the probability of
such a case goes to zero if the sample size increases to infinity.
262
19.107
Interpretation
• A more satisfactory interpretation of maximum likelihood in a particular case.
• Suppose the parameter θ has a finite set (say m) of possible values and suppose
that, a priori of knowing the sample, the statistician considers the probability of
each of this values to be the same (that is 1/m).
• Using Bayes theorem, the posterior probability of a given value of the parameter
1
P (x|θj ) m
= h(x)Lx (θj ).
given the observed sample shall be:P (θj |x) = P P (x|θ
)1
j
19.108
j m
Interpretation
• In words: if we consider the different values of the parameter a priori (of sample
observation) as equiprobable, then the likelihood function is proportional to the
posterior (given the sample) probability of the values of the parameter.
• So that, in this case, the maximum likelihood estimate is the same as the maximum posterior probability estimate.
• In this case, then, while the likelihood is not the probability of a parameter
value (it is proportional to it) to maximize the likelihood means to choose the
parameter value which has the maximum probability given the sample.
19.109
Maximum Likelihood for Densities
• In the continuous case the interpretation is less straightforward. Here the likelihood function is the joint density of the observed sample as a function of the
unknown parameter and the estimate is computed by maximizing it.
• However, given that we are maximizing a joint density and not a joint probability
the simple interpretation just summarized is not directly available.
19.110
Example (Discrete Case)
Example of the two methods. Let X be distributed according to the Poisson distribux e−θ
x = 0, 1, 2, ... Suppose we have a simple random sample
tion, that is: P (x; θ) = θ x!
of size n.
19.111
Example Method of Moments
• For this distribution all moments exist and, for instance E(X) = θ, E(X 2 ) =
θ2 + θ.
263
• If we use the first moment for the estimation√of θ we have θb1 = x̄ but, if we choose
the second moment, we have: θb2 = (−1 + 1 + 4x2 )/2 where x2 here indicates
the empirical second moment (the average of the squares).
19.112
Example Maximum likelihood
• The joint probability of a given Poisson sample is: Lx (θ) =
.
Q
θxi e−θ
i xi !
=
θ
P
i xi e−nθ
Q
xi !
i
• For a given θ this probability does not depend on the specific values of each
single observation but only on the sum of the observations and the product of
the factorials of the observations.
• The value of θ which maximizes the likelihood is θbml = x which coincides with
the method of moments estimate if we use the first moment as the function to
invert.
19.113
More Advanced Topics
• Sampling standard deviation, confidence intervals, tests, a preliminary comment.
• The following topics are almost not touched in standard USA like undergraduate
Economics curricula, and scantly so in other systems.
• They are, actually, very important but only vague notions of these can be asked
to a student as a prerequisite.
• In the following paragraphs such vague notions are shortly described.
19.114
Sampling Standard Deviation and Confidence Intervals
• As stated above, a point estimate is useless if it is not provided with some measure
of sampling error.
• A common procedure is to report the point estimate joint with some measure
related to sampling standard deviation.
• We say “related” because, in the vast majority of cases, the sampling standard
deviation depends on unknown parameters, hence it can only be reported in an
“estimated” version.
264
19.115
Sampling Variance of the Mean
• The simplest example is this.
• Suppose we have n iid observations from an unknown distribution about which
we only know that it possesses expected value µ and variance σ 2 (by the way, are
we considering here a parametric or a non parametric model?)
• In this setting we know that the arithmetic mean is an unbiased estimate of µ.
• By recourse to the usual properties of the variance operator we find that the
variance of the arithmetic mean is σ 2 /n.
• If (as it is very frequently the case) σ 2 is unknown, even after observing the
sample we cannot give the value of the sampling standard deviation.
19.116
Estimation of the Sampling Variance
• We may estimate the numerator of the sampling variance: σ 2 (typically using the
sample variance, with n or better n − 1 as a denominator) and we usually report
the square root of the estimated sampling variance.
• Remember: this is an estimate of the sampling standard error, hence, it too is
affected by sampling error (in widely used statistical softwares, invariably, we
see the definition “standard deviation of the estimate” in the place of “estimated
standard deviation of the estimate”: this is not due to ignorance of the software
authors, just to the need for brevity, but could be misleading for less knowledgeable software users).
19.117
nσ Rules
• In order to give a direct joint picture of estimate an its (estimated) standard
deviation, nσ “rules” are often followed by practitioners.
• They typically report “intervals” of the form Point Estimate ±n Estimated Standard Deviation. A popular value of n outside Finance is 2, in finance we see value
of up to 6.
• A way of understanding this use is as follows: if we accept the two false premises
that the estimate is equal to its expected value and this is equal to the unknown
parameter and that the sampling variance is the true variance of the estimate,
then Tchebicev inequality assign a probability of at least .75 to observations of
the estimate in other similar samples which are inside the ” ± 2σ” interval (more
than .97 for the ” ± 6σ” interval).
265
19.118
Confidence Intervals
• A slightly more refined but much more theoretically requiring behavior is that of
computing “confidence intervals” for parameter estimates.
• The theory of confidence intervals typically developed in undergraduate courses
of Statistics is quite scant.
• The proper definition is usually not even given and only one or two simple examples are reported but with no precise statement of the required hypotheses.
19.119
Confidence Intervals
• These examples are usually derived in the context of simple random sampling
(iid observations) from a Gaussian distribution and confidence intervals for the
unknown expected value are provided which are valid in the two cases of known
and unknown variance.
• In the first case the formula is
and in the second
√ x ± z1−α/2 σ/ n
√ x ± tn−1,1−α/2 σ
b/ n
where z1−α/2 is the quantile in the standard Gaussian distribution which leaves
on its left a probability of 1 − α/2 and tn−1,1−α/2 is the analogous quantile for the
T distribution with n − 1 degrees of freedom.
19.120
Confidence Intervals
• With the exception of the more specific choice for the “sigma multiplier” these two
intervals are very similar to the “rule of thumb” intervals we introduced above.
• In fact it turns out that, if α is equal to .05, the z in the first interval is equal to
1.96, and, for n greater than, say, 30, the t in the second formula is roughly 2.
19.121
Hypothesis testing
• The need of choosing actions when the consequences of these are only partly
known, is pervasive in any human endeavor. However few fields display this need
in such a simple and clear way as the field of finance.
266
• Consequently almost the full set of normative tools of statistical decision theory
have been applied to financial problems and with considerable success, when used
as normative tools (much less success, if any, was encountered by attempts to use
such tools in the description of actual empirical human behavior. But this has
to be expected).
19.122
Parametric Hypothesis
• Statistical hypothesis testing is a very specific an simple decision procedure. It is
appropriate in some context and the most important thing to learn, apart from
technicalities, is the kind of context it is appropriate for
• Statistical hypothesis. Here we consider only parametric hypotheses. Given a
parametric model, a parametric hypothesis is simply the assumption that the
parameter of interest, θ lies in some subset Θi ∈ Θ.
19.123
Two Hypotheses
• In a standard hypothesis testing, we confront two hypotheses of this kind (θ ∈
Θ0 , θ ∈ Θ1 ) with the requirement that, wrt the parameter space, they should be
exclusive (they cannot be both true at the same time) and exhaustive (they cover
the full parameter space.
• So, for instance, if you are considering a Gaussian model and your two hypotheses
are that the expected value is either 1 or 2, this means, implicitly, that no other
values are allowed.
19.124
Simple and Composite
• A statistical hypothesis is called “simple” if it completely specifies the distribution
of the observables, it is called “composite” if it specifies a set of possible distributions. the two hypotheses are termed “null” (H0 ) hypothesis and “alternative”
hypothesis (H1 ).
• The reason of the names lies in the fact that, in the traditional setting where
testing theory was developed, the “null” hypothesis corresponds to some conservative statement whose acceptance would not imply a change of behavior in the
researcher while the “alternative” hypothesis would have implied, if accepted, a
change of behavior.
267
19.125
Example
• The simplest example is that of testing a new medicine or medical treatment.
• In a very stylized setting, let us suppose we are considering substituting and
already established and reasonably working treatment for some illness with a
new one.
• This is to be made on the basis of the observation of some clinical parameter in
a population.
• We know enough as to be able to state that the observed characteristic is distributed in a given way if the new treatment is not better than the old one and
in a different way if this is not the case.
• In this example the distribution under the hypothesis that the new treatment is
not better than the old shall be taken as the null hypothesis and the other as the
alternative.
19.126
Critical Region, Acceptance Region
• The solution to a testing problem is a partition of the set of possible samples into
two subsets. If the actually observed sample falls in the acceptance region x ∈ A
we are going to accept the null, if it falls in the rejection or critical region x ∈ C
we reject it.
• We assume that the union of the two hypotheses cover the full set of possible
samples (the sample space) while the intersection is empty (they are exclusive).
this is similar to what is asked to the hypotheses wrt the parameter space but
has nothing to do with it.
• The critical region stands to testing theory in the same relation as the estimate
is to estimation theory.
19.127
Errors of First and Second Kind
• Two errors are possible:
1. x ∈ C but the true hypothesis is H0 , this is called error of the first kind;
2. x ∈ A but the true hypothesis is H1 , this is called error of the second kind.
• We should like to avoid these errors, however, obviously, we do not even know
(except in toy situations) whether we are committing them, just like we do not
know how much wrong our point estimates are.
268
• Proceeding in a similar way as we did in estimation theory we define some measure
of error.
19.128
Power Function and Size of the Errors
• Power function and size of the two errors. Given a critical region C, for each
θ ∈ Θ0 ∪ Θ1 (which sometimes but not always corresponds to the full parameter
space Θ) we compute ΠC (θ) = P (x ∈ C; θ) that is the probability, as a function
of θ, of observing a sample in the critical region, so that we reject H0 .
• We would like, ideally, this function to be near 1 for θ ∈ Θ1 while we would like
this to be near 0 for θ ∈ Θ0 .
• We define α = sup ΠC (θ) the (maximum) size of the error of the first kind and
θ∈Θ0
β = sup (1 − ΠC (θ)) the (maximum) size of the error of the second kind.
θ∈Θ1
19.129
Testing Strategy
• There are many reasonable possible requirements for the size of the two errors
we would the critical region to satisfy.
• The choice made in standard testing theory is somewhat strange: we set α to
an arbitrary (typically small) value and we try to find the critical region that,
given that (or a smaller) size of the error of the first kind, minimize (among the
possible critical regions) the error of the second kind.
• The reason of this choice is to be found in the traditional setting described above.
If accepting the null means to continue in some standard and reasonably successful therapy, it could be sensible to require a small probability of rejecting this
hypothesis when it is true and it could be considered as acceptable a possibly big
error of the second kind.
19.130
Asymmetry
The reader should consider the fact that this very asymmetric setting is not the most
common in applications.
19.131
Some Tests
• One sided hypotheses for the expected value in the Gaussian setting. Suppose
we have an iid sample from a Gaussian random variable with expected value µ
and standard deviation σ.
269
• We want to test H0 : µ ≤ a against H1 : µ ≥ b where a ≤ b are two given
real numbers. It is reasonable to expect that a critical region of the shape:
C : {x : x > k} should be a good one.
• The problem is to find k.
19.132
Some Tests
• Suppose first σ is known. The power function of this critical region is (we use
the properties of the Gaussian under standardization):
ΠC (θ) = P (x ∈ C; θ) = P (x > k; µ, σ) = 1 − P (
= 1 − Φ(
x−µ
k−µ
√ ≤ √ )=
σ/ n
σ/ n
k−µ
√ )
σ/ n
• Where Φ is the usual cumulative distribution of the standard Gaussian distribution.
19.133
Some Tests
• Since this is decreasing in µ the power function is increasing in µ, hence, its
maximum value in the null hypothesis region is for µ = a.
• We want to set this maximum size of the error of the first kind to a given value
k−a
√ ) = α so that k−a
√ = z1−α so that k = a + √σ z1−α .
α so we want: 1 − Φ( σ/
n
σ/ n
n
• When the variance is unknown the critical region is of the same shape but k =
b and t are as defined above.
a + √σbn tn−1,1−α where σ
19.134
Some Tests
The reader should solve the same problem when the hypotheses are reversed and compare the solutions.
19.135
Some Tests
• Two sided hypotheses for the expected value in the Gaussian setting and confidence intervals.
• By construction the confidence interval for µ (with known variance):
√ x ± z1−α/2 σ/ n
contains µ with probability (independent on µ) equal to 1 − α.
270
• Suppose we have H0 : µ = µ0 and H1 : µ 6= µ0 for some given µ0 . The above
recalled property of the confidence interval implies that the probability with
which
√ x ± z1−α/2 σ/ n
contains µ0 , when H0 is true, is 1 − α.
19.136
Some Tests
√ • The
critical
region:
C
:
x
:
µ
∈
/
x
±
z
σ/
n or, that is the same: C :
0
1−α/2
√ x:x∈
/ µ0 ± z1−α/2 σ/ n has only α probability of rejecting H0 when H0 is
true.
• Build the analogous region in the case of unknown variance and consider the
setting where you swap the hypotheses.
271
20
20.1
*Taylor formula in finance (not for the exam)
*Taylor’s theorem.
Let k ≥ 1 be an integer and let the function f : R → R be k times differentiable at the
point a ∈ R. Then there exists a function hk : R → R such that
f (x) = f (a) + f 0 (a)(x − a) +
f 00 (a)
f (k) (a)
(x − a)2 + · · · +
(x − a)k + hk (x)(x − a)k
2!
k!
and
lim hk (x) = 0.
x→a
The last term in the formula is called the Peano form of the remainder.
The Peano remainder only tells us that, if we define: Pk (x) = f (a) + f 0 (a)(x −
(k)
00
a) + f 2!(a) (x − a)2 + · · · + f k!(a) (x − a)k the Taylor polynomial of the function f at
the point a, and: Rk (x) = f (x) − Pk (x) the remainder term, we have that Rk (x) =
Rk (x)
o(|x − a|k ), x → a that is limx→a |x−a|
k = 0.
Notice that there is no reason to call Pk (x) the “Asymptotic Best Fit” of a polynomial to f (x) over any interval centered in a and, in fact: no specific interval and no
error measure to be minimized were defined in order to define the Taylor polynomial.
There are several ways to fit polynomials to functions over intervals which give in many
sense “best fit” than Taylor formula. The precise meaning of Taylor formula shall be
clearer if you study the proof of the formula sketched below.
Moreover, the result of Taylor’s theorem only tells us something about the speed
with which the remainder goes to 0 when we consider x → a. In fact the Peano form of
the remainder tells us nothing about the size of the remainder for a given x not equal
to a.
20.2
*Remainder term
In order to make more precise statements about the size of the remainder term over
an interval of interest we need stronger hypotheses.
Two famous results are as follows.
Lagrange form:
Let f : R → R be k + 1 times differentiable on the open interval and continuous on
the closed interval
between a and x. Then
(k+1) (ξ )
L
(x − a)k+1 for some real number ξL between a and x.
Rk (x) = f (k+1)!
Cauchy form:
(k+1)
Rk (x) = f k! (ξC ) (x − ξC )k (x − a) for some real number ξC between a and x.
Notice that the error term is for a given x. ξL shall change if you change x. However
the result is still very useful as it allows for bounding the remainder over any given
interval by maximizing it over the interval (see the example below).
272
20.3
*Proof
There is a very nice proof of Taylor theorem plus Lagrange remainder, which is only
algebraic (that is: it does not require limit statements and is only based on algebraic
operations) once you suppose the required derivatives exist.
In fact, this proof is a streamlined version of Lagrange’s own proof.
Suppose you can write a function f (x + h) as
f (x + h) = a0 + a1 h + a2 h2 + ... + an hn + ah hn+1
if f is bounded in an interval around x, it is obviously always possible to get the
equality since ah depends on h.
Now, fix some h∗ and the corresponding ah∗ . The function
f (x + h) − (a0 + a1 h + a2 h2 + ... + an hn + ah∗ hn+1 )
shall be equal to 0 if h = h∗ and, if we set ao = f (x) it shall be equal to 0 also when
h = 0. However, if a function is equal to 0 in two points and it is differentiable in
between (and hypothesis we make) then the first derivative of the function must be 0
for some point, say h0 between 0 and h∗ (this is Rolle’s theorem, by Michael Rolle,
France, 1652-1719, a contemporary of King Louis XIV, and it is a less obvious and
much more powerful result than it seems).
The first derivative of the function is
f 0 (x + h) − (a1 + 2a2 h + ... + nan hn−1 + (n + 1)ah∗ hn )
and if we set a1 = f 0 (x) = f (1) (x) this function is equal to 0 both at h = 0 and h = h0 .
We can then repeat the argument: there must exist some point h1 between 0 and
00
h0 where the derivative of this function is 0, and if we set 2a2 = f (x) = f (2) (x) the
equality to 0 of the derivative is true also for h = 0. So there must exist a point
between 0 and h1 where the derivative of this derivative is zero... and so on.
After repeating this n times we get
f (n) (x + h) − (n!an + (n + 1)!ah∗ h)
There must be some point hn between 0 and hn−1 where this is 0 and if we set
(n)
an = f n!(x) the function shall be equal to 0 also if h = 0.
We take a derivative more and we get
f (n+1) (x + h) − (n + 1)!ah∗
which, must be zero for some hn+1 between 0 and hn so that
ah∗ =
f (n+1) (x + hn+1 )
(n + 1)!
273
In the end, we have found that, for any value of h, there exists an Hh (the hn+1 of
the proof) between 0 and h such that
f (n) (x) n f (n+1) (x + Hh ) (n+1)
h +
h
n!
(n + 1)!
where the notation should help to stress the dependence of the “coefficient” of the last
(n+1) (x+H )
h
term ( f (n+1)!
) on the specific point h (one can avoid the dependence if and only if
the function to be approximated is itself a n + 1 degree polynomial).
f (x + h) = f (x) + f (1) (x)h + ... +
20.4
*Taylor formula and Taylor series
When the function f is infinitely differentiable around a we can legitimately consider
Pk (x) for k → ∞ as the Taylor series generated by f . There is, however, no reason
why this series in general should converge and, if it converges, that the convergence
should be to f . In fact when these two properties hold we say that f is a member in
a very important class of functions: analytic functions. It is quite possible that the
Taylor series generated by a given function (necessarily not analytic) is identical to the
Taylor series of another function and if this is analytic, then the series shall converge
to it.
1
A standard example when this happens is the function f (x) = e− x2 . This is infinitely differentiable in x = 0. If we compute any f (k) (0) this is always equal to 0 so
that the Taylor polynomial converges and it is always zero on any interval, the remainder is, obviously, the functions itself which is o(|x − a|k ) for any k ≥ 0 because it goes
to 0 for x → a faster that any power. As we see Taylor theorem is satisfied for any k,
the Taylor series converges but it converges to the null function. The null function is,
obviously, analytic and its Taylor polynomials (not the remainders!) are all identical
to those of f (x). Clearly the two functions are different if x 6= 0.
20.5
*Taylor formula for functions of several variables
The general notation is quite cumbersome:
Let |α| = α1 + · · · + αn , α! = α1 ! · · · αn !, xα = xα1 1 · · · xαnn for α ∈ N n and
x ∈ Rn . If all the k-th order partial derivatives of f : Rn → R are continuous at
a ∈ Rn , then one can change the order of mixed derivatives at a, so the notation
|α| f
|α| ≤ k for the higher order partial derivatives is not ambiguDα f = ∂xα∂1 ···∂x
αn ,
n
1
ous. The same is true if all the (k − 1)−th order partial derivatives of f exist in
some neighborhood of a and are differentiable at a. Then we say that f is k times
differentiable at the point a .
Multivariate version of Taylor’s theorem.
Let f : Rn → R be a k times differentiable function at the point a ∈ Rn . Then
there exists hα : Rn → R such that
274
f (x) =
X
X Dα f (a)
(x − a)α +
hα (x)(x − a)α ,
α!
and
(1)
|α|=k
|α|≤k
lim hα (x) = 0.
(2)
x→a
The remainder term can be written in the Lagrange form as
X Dα f (ξL )
(x − a)α
α!
|α|=k+1
Where ξL = a + (x − a)cL and cL is a scalar with cL ∈ (0, 1).
In words: ξL is a vector starting from a in the direction of x whose length is a
fraction c of the distance between a and x . This is the multidimensional analogue of
the sentence: “for some real number ξL between a and x” used in defining remainders
in the one dimensional case.
If you suppose n = 2 and α = 2, the Taylor formula amounts to:
f (x1 , x2 ) = f (a1 , a2 ) + fx1 (a1 , a2 )(x1 − a1 ) + fx2 (a1 , a2 )(x2 − a2 )+
1
1
+fx1 .x2 (a1 , a2 )(x1 − a1 )(x2 − a2 ) + fx21 (a1 , a2 )(x1 − a1 )2 + fx22 (a1 , a2 )(x2 − a2 )2 +
2
2
+h1,1 (x1 , x2 )(x1 − a1 )(x2 − a2 ) + h2,0 (x1 , x2 )(x1 − a1 )2 + h0,2 (x1 , x2 )(x2 − a2 )2
20.6
*Simple examples of Taylor formula and Taylor theorem
in quantitative Economics and Finance
We can do beautiful and easy things with polynomials which we cannot do so easily
with other functions.
For instance: it is easy to differentiate or integrate polynomials and results of these
operations are again polynomials.
It is easy to find expected values of polynomials as these only involve moments.
Sums, differences and products of polynomials are still polynomials71 .
The “taking the ratio” operation requires some more precision: polynomials are a special case of
a larger class of functions called “rational functions”. Rational functions are functions which can be
written as ratios of polynomials (if the denominator is 1 we are back with polynomials). Rational
functions are closed to the operations listed in the text and to ratios. Moreover integrals and derivatives
are easy to compute.
71
275
Finally, two polynomials are equal if and only if they have identical coefficients for
terms with the same power.
There is also a famous result, the Stone-Weierstrass theorem, that tells us this: for
any continuous function f (x) on a closed interval [a, b]and any number > 0 there
exist a polynomial gf, (x) such that supx∈[a,b] |f (x) − gf, (x)| < . That is: we can
approximate the given function with a polynomial over an interval with maximum
error . We can actually build such polynomial (if we know how to compute f (x))
using an interesting tool which is borderline between Probability and Analysis called
“Bernstein Polynomial”.
This result implies that polynomials are, in a precise (and to be understood precisely!) sense, all we do need if we are modeling phenomena using continuous functions
on bounded intervals.
It is then enticing, when we do algebra with functions, that is: when we try to figure
out the implication of our mathematical models, to try and approximate the functions
of interest with polynomials before acting on them.
The problem here is that, even if the initial approximation is good, it may well be
that even simple operations on the functions amplify the size pf the error.
For instance, the second order approximations near x = 0 of the polynomials x +
x2 + .00001x3 and x + x2 are identical and are x + x2 . For the second polynomial this
is perfect and for the first is very good, if x is not too far from 0. However if we take
even the simple difference of the exact functions and of their approximation, we find
two different results (.00001x3 and 0). Not much, but: suppose this difference, further
on in our modeling, is multiplied by a big number, or maybe, this is a part of a formula
and you are interested in some limit for x going to infinity.
The possible problems are clear.
In the following examples we shall consider some simple financial applications of
polynomial approximations based on Taylor formula. Many other polynomial based
approximation method exist (we mentioned Bernstein polynomials. Another very important approximation toolbox in applied Mathematics is that based on orthogonal
polynomials)
We warned the Reader should be warned that this is a very tricky job: approximations may work for some purpose and dramatically fail for others. Conditions and rules
for a cautious use of the trick exist and a detailed study of these is required before an
independent foray in the use of polynomial approximations in general and of Taylor
approximations in particular.
20.7
*Linear and log returns, a reconsideration
As discussed at the beginning of these notes, in finance the evolution of securities
values, for instance in the case of stock prices, is often modeled in terms of returns
not of prices. We do not directly model the evolution of prices over time (which is,
276
obviously, our main interest) but the return process and from this, if necessary, we
derive price behavior.
The reason for this very peculiar attitude is partially similar to the reason behind
the fact that physical models of motion are usually written in terms of acceleration
and not of speed or position: the Mathematics is simpler.
In simple physical models accelerations are proportional to forces (Newton second
law, the proportion being given by the reciprocal of mass) and forces “add” in a simple
way, in finance returns, properly defined, still “add in a simple way” (and the analogy
ends here). Moreover, as a first approximation (here not in Taylor sense!) it is an
empirical fact that returns (again, properly defined) can be considered as statistically
identically distributed and independent over time, while this is not true for price levels
and price differences. For price levels this is obvious, for prices differences it is another
empirical fact that, as a rule, the variance depends on the level of price. It is much
easier to start modeling independent random variables and then derive models for
variables that are functions of these, hence the choice of returns.
The problem is what is the “right” definition of return. Here we have to optimize a
trade off between the “natural and intuitive” definition of return, which existed well before quantitative finance, and a definition of return easier to work with in mathematical
terms (the “properly defined” proviso above).
When we wish to define return we must first consider if only to take into account
price evolution or also consider, for instance, dividends and similar actions on the firm’s
capital for a stock, or coupons for a bond (here we shall stick to the stock case as an
example).
Secondly, we need to choose a particular formula for the return definition.
Suppose we are interested in a time period between t and t + 1. At time t the price
of a share of stock is Pt and at time t + 1it is Pt+1 . Moreover, between the two dates
some dividend was distributed, let us call this Dt (we are supposing that the dividend
is known at t but is distributed at t + 1with no accrual so that Pt is the “cum dividend”
price).
− 1while a very
A very simple definition of “percentage price return” is rt+1 = PPt+1
t
Pt+1 +Dt
simple definition of “percentage total return” is Rt+1 =
− 1. It is clear that,
Pt
from the point of view of the “financial meaning” of the result, while often it is the first
formula to be used in financial newspapers, the second one is the more appropriate to
express the “percentage gain” of holding the share between t and t + 1.
This is a very simple definition with a very simple interpretation in terms of percent
gain. It is not always a very useful definition if we want to model returns.
The problem is that simple percentage (sometimes called “linear” ) returns (price
only or total) do not add over time.
In fact, with price returns
rt+2,t =
Pt+2
Pt+2 − Pt+1 + Pt+1
Pt+2 − Pt+1 Pt+1 Pt+1 − Pt
−1=
−1=
+
=
Pt
Pt
Pt+1
Pt
Pt
277
=
Pt+2 Pt+1 Pt+1 Pt+1
−
+
− 1 = (1 + rt )(1 + rt+1 ) − 1
Pt+1 Pt
Pt
Pt
or, better, in terms of “gross returns” 1 + rt+2,t = (1 + rt+2 )(1 + rt+1 ), that is PPt+2
=
t
Pt+2 Pt+1
.
Pt+1 Pt
With total returns the mess may be even worse depending on how we deal with
period dividends in defining multi period returns: do we simply add them or consider
accruals?
According to the first possibility we get
1+Rt+2,t =
Pt+2 + Dt+1 Pt Dt
Pt+2 + Dt+1 Pt+1 + Dt − Dt Dt
Pt+2 + Dt+1 + Dt
=
+
=
+
=
Pt
Pt+1
Pt Pt
Pt+1
Pt
Pt
= (1 + Rt+2 )(1 + Rt+1 ) − (1 + Rt+2 )
Dt
Dt Dt
+
= (1 + Rt+2 )(1 + Rt+1 ) − Rt+2
Pt
Pt
Pt
t (1+Rt+2 )
If, instead, we define 1 + Rt+2,t = Pt+2 +Dt+1 +D
we get 1 + Rt+2,t = (1 +
Pt
Rt+2 )(1 + Rt+1 ) .
A possible defense, beyond simple opportunity, of this “with accruals” definition, is
t
that difference with the no accrual is just given by term Rt+2 D
which is likely to be
Pt
small. Beware! this is going to be true only for short time spans.
The mathematical solution for the additivity problem, when considering gross price
∗
=
return or total return with accrual, is immediate: take the (natural) logarithm rt+2,t
∗
∗
∗
∗
∗
ln(1 + rt+2,t ) = rt+2 + rt+1 or rt+2,t = ln(1 + rt+2,t ) = rt+2 + rt .
What happens with total returns? The problem is in the definition of total return
over many time periods. If we define total return over one time period as we did:
∗
Rt+1
= ln
Pt+1 + Dt
Pt
for total return over two time periods we could define:
∗
Rt;t+2
= ln
Pt+2 + Dt+1 + Dt
Pt+2 + Dt+1
Pt+1 + Dt
6= ln
+ ln
Pt
Pt+1
Pt
So that no simple aggregation holds.
However it is clear that in the previous definition we do not consider dividend
reinvested. Let us suppose than dividend Dt is paid at time t + 1 (and the same for the
other dividends). Between time t + 1 and t + 2 this dividend is reinvested in the same
stock so that at time t + 2 its value is Dt (Pt+2 + Dt+1 )/Pt+1 . Keeping this in mind we
define
∗
Rt;t+2
= ln
Pt+2 + Dt+1 + Dt (Pt+2 + Dt+1 )/Pt+1
(Pt+2 + Dt+1 )Pt+1 + Dt (Pt+2 + Dt+1 )
= ln
=
Pt
Pt Pt+1
278
= ln
(Pt+2 + Dt+1 )(Pt+1 + Dt )
Pt+2 + Dt+1
Pt+1 + Dt
= ln
+ ln
Pt Pt+1
Pt+1
Pt
hence, if we take into account dividend reinvestment according to this convention (other
exist but only this one gives the required result) we have that time additivity holds not
only for simple (or price) log returns but also for total log returns. It is to be noticed
that this formula requires the knowledge of Pt+1 to compute the two period return,
while the no compounding formula does not. This means that we can compute time
additive total log returns, however in order to do so we need intermediate prices or,
at least, capitalized dividends. In practice multi period returns can only be computed
by adding single period returns while, in the simple log return case, it is possible to
compute many periods log returns by simply knowing initial and terminal price.
20.8
*Taylor theorem and the connection between linear and
log returns
Now Taylor’s theorem. Can it help us in assessing how much we are wrong if we take
log returns in the place of linear returns and if we suppose that log total returns are
the sum of log price returns and log dividend price ratio?
Let us start with log price returns.
∗
if we suppose that the ratio of the two prices is near 1 (that is: if we
rt+1
= ln PPt+1
t
allow for a short time span between the two prices and suppose non significant new
information arrives in the interval) we can approximate the log return around x = 1
.
when x = PPt+1
t
It is obvious that in any open interval including x = 1 and non including x = 0 we
have that ln x is differentiable any number of times so that we can define its Taylor
series as:
(x − 1)2 (x − 1)3
+
− ...
2!
3!
∗
≈ rt+1 .
If we truncate this expansion to the first order we get ln x ≈ x−1 so that rt+1
What happens for total returns? It is clear that the above argument still holds
simply taking x = (Pt+1 + Dt )/Pt .
In both cases, as we did see in the first chapters of these handouts, since x − 1is
tangent to ln x when x = 1 (as the two functions have the same value 0 and the same
derivative 1 at that point) and the second derivative of ln x is always negative, we have
that x − 1 ≥ ln x so that the linear return shall never be smaller than the log return.
ln x = 0 + 1(x − 1) −
20.9
*How big is the error?
This depends on how far x is from 1 that is: how far price or price plus dividend moved
from past price.
279
(k+1)
(ξL )
According to the Lagrange remainder formula we have Rk (x) = f (k+1)!
(x − a)k+1
which, in our case, becomes R1 (x) = − ξ21 2 (x − 1)2 .
L
We see at once that the error is always negative, that is, the log return cannot be
bigger that the linear return. Moreover, it is always going to behave (locally) as a
1
2
parabola and, for x in any interval 0 < α < β shall be bounded by − 2 min(α
2 ,1) (x − 1)
(obviously we can compute it to any precision for any value of x, but the result is still
useful for a clear understanding of the approximation properties).
To have an idea, for a change of ±10% in price, the bound just given assess that
.01
= −.006172 . The actual error for a a
the error shall be less than − 2.91 2 (.1)2 = − 1.62
10% change in price is -.00536 when the change is downward. For a 10% increase in
price the error is -.00469.
In the end, is the error negligible?
It obviously depends on the likelihood of big changes in price (price plus dividend)
in the time interval under consideration and on the precision we want to achieve with
our model.
20.10
*Gordon model and Campbell-Shiller approximation.
Let us now consider a related important topic.
t +Dt
t
t
= ∆PPt+1
+D
= rt+1 + D
, that
If we do simple algebra we get Rt+1 = Pt+1 −P
Pt
Pt
Pt
t
Dt
is: the price return plus the dividend to price ratio. In other words Pt = Rt+1 − rt+1 .
This, ex post, is an identity and we can take the expected value conditional on t and
t
get D
= Et (Rt+1 − rt+1 ).
Pt
Still an identity, not a model. That is: these relations are always true and put no
constraint on the observable variables. In fact this identity is a little bit of a cheat
t
= Rt+1 − rt+1 implies that Rt+1 − rt+1 is non
as, conditional on t, the identity D
Pt
stochastic as, obviously, it does not contain the only stochastic (given t) element in the
t + 1 returns. That is: Pt+1 .
However, if we do not condition on t, this becomes a relevant identity because it says
that, whatever the model, the dividend price ratio at time t, now a random variable
as we are not conditioning on t any more, is identical to the (now random) conditional
on t expected value of future excess returns, so that any model of future excess returns
expectations is a model of the dividend price ratio. Any change in expectations for
future excess returns must imply a change in dividend price ratio.
Let’s create a model.
The simplest example is the Gordon model.
This is a very old idea which still is the groundwork for much “self understanding”
on the part of companies and investors in the market.
The starting point is the idea that the market price of a company (and to be simple
suppose that this is equal to the price of its stock) must be the same as the flow of
280
its future dividends discounted using as discount rate the cost of capital r (supposed
constant) for the company. In order to get a simple formula we shall suppose that the
cost of capital is constant and that future dividends are equal to current dividend plus
P
(1+g)i−1
a constant percentage dividend growth g. We have then Pt = ∞
i=1 Dt (1+r)i (here we
implicitly suppose that the first dividend Dt is paid at t + 1).
P∞ i
θ
From this we get (using the standard result for power series:
i=1 θ = 1−θ ; 0 ≤
θ < 1, and supposing g < r)
Dt
=r−g
Pt
Recalling now our previous result, this implies that the excess expected return is
roughly a constant Et (Rt+1 − rt+1 ) = r − g given by the difference between the cost
of capital and the rate of growth of dividends. Constant because in the Gordon model
we suppose these two parameters to be constant.
The result or, more properly, the hypothesis, is surprisingly powerful, if taken seriously. For instance, suppose you have an hypothesis about dividends and know the
price of the company, the formula gives you a way of computing the cost of capital
compatible with the current price and the hypothesis on dividends so that, if you (or
the company) wish to enter some investment at some cost, you may think reasonable
compare the dividends you expect from the investment with its cost in order to guess
if this shall increase or decrease the price (value) of the company.
A second implication is that, in absence of structural change, the dividend price
ratio should be almost constant so that if you can forecast dividends you can also
forecast prices. While this is a simple rewording of the hypothesis of the model, the
rather obvious conclusion (we repeat: to be honest this is actually an hypothesis) is
the implicit basis of many Corporate Finance “folk” reasoning.
Another simple rewording of the hypothesis is the common notion that dividend
growth cannot be bigger (at least for an indefinite amount of time) than cost of capital.
Formally, this is nothing but the condition for convergence of the power series but
has an interesting interpretation when we think to “bubbles” which, in this simple
interpretation, could simply come from the idea that for some “new kind of company”
in some “new kind of world” dividend growth shall always be bigger that cost of capital.
This is a corporate equivalent to changing lead into gold. In the related academic
literature the condition for the convergence of the power series takes, not surprisingly,
the name of “condition for the absence of rational bubbles”.
But beware! In Economics sometimes gold becomes lead and lead gold and this
with no need of atomic reactions. Time and history suffice. Sometimes gold becomes
lead: think for instance to the history of the economic value of salt. Today you put it in
dish washing machines, not so long time ago they used to fight wars over it. Sometimes
lead becomes gold. Think about oil for instance. It used to be a big nuisance as it
fouled otherwise useful farming fields, now things are quite different. Countries like
the USA are purposely destroying water reserves to recover it from shales.
281
In fact each time markets are in a bubble, examples as these are quoted by both
the party of reason and the party of exuberance.
Now let us move a step further. Gordon model is written in terms of linear returns.
What happens if we start with log total returns?
t
, something
The first possible line of attack is to say that, since Rt+1 = rt+1 + D
Pt
D
∗
∗
t
like Rt+1 ≈ rt+1 + ln Pt should be true
Here things get tricky. This formula cannot work, even approximately. At a glance
we see that, while a value near one for the ratio of consecutive prices is reasonable, so
∗
may be true, xand ln x are always very different. We
that something like rt+1 ≈ rt+1
are sure that our x cannot be negative, but it shall be much smaller that 1 (except in
very exceptional cases) so that its logarithm shall always be negative. . We must be
careful.
In a widely quoted paper by Campbell and Shiller (“The dividend price ratio and
expectations of future dividends and discount factors”, Review of Financial Studies,
1988) we find the following result
ln
Pt+1 + Dt
Pt
∗
= Rt+1
≈ k − ρ ln
Dt−1
Dt
Dt
+ ln
+ ln
= k − ρδt+1 + δt + ∆dt
Pt+1
Pt
Dt−1
Notice that, while different than the above naive approximation, also in this case
we have logarithms of ratios which could be near or equal to zero. The Authors do not
give a rigorous proof of this approximation and state that in order to get it one must
use Taylor formula to the first order for the total log return written as a function of
δt+1, δt and ∆dt+1 where the expansion is given at the point δt+1 = δt = δ and ∆dt+1 = g
(a constant).
Let us try and justify such an argument: the heuristic derivation given by the
authors in the paper does not allow for any proof of the quality of the approximation
and, in fact, the Authors only present a numerical study of the approximation itself.
In fact it is not difficult to provide a rigorous derivation.
First let us write the log total return as a function of the required variables.
Pt+1 + Dt
= ln(Pt+1 + Dt ) − ln Pt = ln(Pt+1 + Dt ) − pt =
ln
Pt
Dt
Dt
= ln Pt+1 (1 +
) − pt = ln 1 +
+ pt+1 − pt =
Pt+1
Pt+1
= ln 1 + eδt+1 + pt+1 − pt
Notice that, simply by writing δt+1 = ln(Dt /Pt+1 ) we suppose Dt > 0, this is
important for what follows. Now let us expand ln 1 + eδt+1 using Taylor formula at
the first order at the point δt+1 = δ.
282
ln 1 + eδt+1 ≈ ln 1 + eδ +
eδ
(δt+1 − δ)
1 + eδ
so that
∗
≈ ln 1 + eδ +
Rt+1
= ln 1 + eδ +
eδ
(δt+1 − δ) + pt+1 − pt =
1 + eδ
eδ
(δt+1 − δ) + pt+1 + dt − dt + dt−1 − dt−1 − pt =
1 + eδ
eδ
eδ
δ
+
δt+1 − δt+1 + δt + dt − dt−1 =
1 + eδ
1 + eδ
eδ
eδ
δ
= ln 1 + e −
δt+1 + δt + dt − dt−1
δ− 1−
1 + eδ
1 + eδ
1
eδ
which is of the required form with k = ln 1 + eδ − 1+e
δ δ and ρ = 1+eδ . Remember
t+1
that the approximation only involves the ln(1 + eδ ) term, the rest of the formula is
exact.
∗
∗
t
+ ln D
≈ rt+1
Recalling the hope of writing something like: Rt+1
we can slightly
Pt
modify the argument in the following way
= ln 1 + eδ −
∗
Rt+1
= ln(
D
Pt+1 Dt
∗
ln t
+
) = ln(ert+1 + e Pt )
Pt
Pt
∗
t
= 0 and ln D
if we now expand this around rt+1
= δ we get
Pt
∗
Rt+1
≈ ln(1 + eδ ) +
1
eδ
Dt
∗
r
+
(ln
− δ)
t+1
1 + eδ
1 + eδ
Pt
which is of the form
∗
∗
Rt+1
≈ a + brt+1
+ c ln
Dt
Pt
∗
∗
t
Which is as near as we can go to the naive and wrong idea Rt+1
≈ rt+1
+ ln D
and,
Pt
yes, while we can expect a very near 0 and b very near 1, we can also expect c very near
0 (why all this?) so that the naive idea is VERY wrong. Notice in the very unlikely
case when dividends are expected to be HUGE with respect to Pt , that is similar in
value to Pt , ln(Dt /Pt ) ≈ 0, but, in this case a ≈ ln 2 − 1/2 ≈ .193 and b ≈ 1/2.
Notice also that both this approximation and the original Campbell and Shiller
approximation depend on the hypothesis of non null dividends. In fact they break
down if dividends are (and is quite a possibility) very near 0 (in case of 0 dividends
the approximations are not even defined). For instance: in this case we should have
∗
∗
∗
∗
t
Rt+1
= rt+1
while the approximation Rt+1
≈ a + brt+1
+ c ln D
is going to give us a
Pt
283
very negative and meaningless number. On the other hand the Campbell and Shiller
approximation shall give either meaningless negative or positive values depending on
which between δt and δt+1 corresponds to almost null dividends (and if they both are...).
For this reason the expansion should not be used over short time periods (small
probability of dividend) ad it is better used for stock portfolios that for stocks (at least
some stock in the portfolio should yield dividends for each period if the time period
is not too short). Obviously, always recall that log returns do not add across different
stocks.
Let us now go back to Campbell and Shiller approximation. Notice that k and ρ
depend on the point around which we do expand the log return using Taylor. This
implies that, in what follows, if we intend to use the formula by, for instance, iterating it
or summing it over different times, we must assume that the point where the expansion
is run is constant and, in particular, does not depend on δt . Keep this in mind in order
to understand what follows.
Above we showed that log total returns are additive over time if we consider reinvestment of the dividends in the stock. We are now approximating log returns and
could hope that, while an approximation, the Taylor expansion is additive, that is: the
Tailor approximation of the many period log total return is identical to the sum of the
Taylor approximations of one period log total return.
This is obviously true, as the log total return with reinvested dividends is equal to
the sum of the two one period log total returns and the two variable Taylor formula at
the first order does not require mixed derivatives. However we need some care with the
expansion point. Our one period return expansion is in terms of δt+1 . We now have
δt+1 and δt+2 . We have a function of two variables. In order to compare approximations
we need to expand the two variable function around a point with identical coordinates,
let us say δ. Let us begin with the sum of the two expansions for the two one period
returns.
ln
Pt+2 + Dt+1
Pt+1 + Dt
+ ln
≈ k − ρδt+2 + δt+1 + dt+1 − dt + k − ρδt+1 + δt + dt − dt−1 =
Pt+1
Pt
= 2k − ρ(δt+2 + δt+1 ) + δt+1 + δt + dt+1 − dt−1
where 2k = ln(1 + eδ )2 −
by writing
eδ
2δ
1+eδ
and ρ =
1
.
1+eδ
This is exactly what we would get
∗
Rt;t+2
= ln(1 + eδt+2 )(1 + eδt+1 ) + pt+2 − pt
and expanding the log part to the first order around the point of coordinates [δt+2 =
δ; δt+1 = δ].
In the end: if we use the convention of reinvesting dividends in the stock and
suppose to do this on a period by period basis, log total returns are additive over time
and Taylor expansion keeps this linearity property.
284
This is quite expedient, from the point of view of simplicity remember, however,
that, as already mentioned, if we define returns in this way and then proceed in building
models for returns we must fully understand that returns so defined are not the original
ratios of quantities. They are no more percentages and, overall, may have very different
properties.
For instance: percentage return have a lower bound (-1 that is -100%) while log
return have no lower bound, so that the Gaussian distribution cannot be a satisfactory
model for percentage returns while it can be a good model for log returns.
20.11
*Remainder term
We are now in the position of computing the remainder term in the Lagrange form
which shall be
∂ 2 ln 1 + eδt+1 δt+1 =δ
L
(δt+1 − δ)2 =
2!
=
e δL
1
(δt+1 − δ)2
≤ (δt+1 − δ)2
δ
2
L
(1 + e )
2
8
(The maximum of x/(1 + x)2 is for x = 1 and is equal to 1/4).
With an error of this size we are able to “take out linearly” δt+1 from the logarithm.
∗
. The rest of the
Notice that no other approximation is made in the formula for rt+1
result only depends on adding and subtracting dt and dt−1 and reordering terms.
Again: the error is always positive, the approximation always underestimates the
true total return. The size of the error depends in a quadratic way from the distance of
the log price dividend from the constant δ. If the log dividend price ratio is a stationary
process with small variance and we choose δ = E(δt ) (obviously in view of stationarity
this does not depend on t) the approximation is going to work very well. In fact in this
case the expected value of the maximum error shall be:
1
1
1
E( (δt+1 − δ)2 ) = E( (δt+1 − E(δt+1 ))2 ) = V ar(δt+1 )
8
8
8
However beware: this is true for each single step in the approximation, if we use the
approximation for computing many terms and then summing them we are going to run
into problems. In fact: since the error is always of the same sign for each single term,
any summation is going to increase its overall size. Hence: be careful if you see (as
shall be the case) this approximation used in any summation or series. It may work
but care is needed.
20.12
*Dividend price model
A common use of the approximation is as follows.
285
Suppose the approximation is actually exact so that
∗
= k − ρδt+1 + δt + ∆dt
Rt+1
(With ∆dt = dt − dt−1 ). If we forget everything about the approximations (Errors
and constants which are really not constants) this may be taken as a linear difference
equation for δt
∗
δt = ρδt+1 − k − ∆dt + Rt+1
which can be solved iteratively forward in time (beware! recall the analysis we did
a moment ago about repeated use of the approximate identity) as
δt = ρ
m+1
δt+m+1 +
m
X
∗
ρj (Rt+j+1
− ∆dt+j − k)
j=0
If we suppose that the summation converges (in some sense) as m goes to infinity we
have
δt =
∞
X
∗
ρj (Rt+j+1
− ∆dt+j ) −
j=0
k
1+ρ
Notice that, for the properties of the approximation error, such a convergence would
require, at the very least, for the approximation error to go to 0: we wrote the approximation supposing it equal to 0. But this would be possible only if the log dividend
price ratio became a constant equal to δ.
Now, put this worry in the back of your mind and take the expected value conditional on t of the two sides of this equation. We get
δt =
∞
X
∗
ρj Et (Rt+j+1
− ∆dt+j ) −
j=0
k
1+ρ
As before, this formula, apart from the approximation, is still an identity but, if we
use any model according to which the current log price dividend ratio is, apart from a
constant, given by the sum of the discounted expected values (corrected for risk) of the
future log dividend changes, we must accept that the proper discount rate is h = 1/ρ-1
(so that the discount factor 1/(1 + h) is equal to ρ) and the risk premium is given by
∗
Et (Rt+j−1
). In other words, apart from the approximation error, this “identity” imposes
constraints to the modeling of price dividend ratio, dividend growth and returns as,
roughly, to model any two of these implies to model the third. A very imprecise
but evocative way to read this statement is that you should, for instance, be able to
forecast, at least partially, future returns if you have a model of dividend growth and
you observe the price dividend ratio.
286
∗
Moreover, if we assume Et (Rt+j+1
− ∆dt+j ) = q, some constant, we are back to the
Gordon model, now in log return form (at least, this is what the Authors say, but some
care should be taken dealing with ρ and k).
Something more on the approximation. We should keep in mind that the linearity
in the “difference” equation fully comes from the approximation and strictly requires
that ρ be a constant. This is true only if we suppose the expansion point is always
the same. So that, when we iterate the equation we are going to incur in different
approximation errors for each iterate depending in how far the actual δt+1 is from the
expansion point δ.
20.13
*What happens if we take the remainder seriously
As an exercise in critical analysis of a Taylor formula application, let us think a little
bit about the consequences of the approximation.
Two points must be kept in mind. The first is that a Taylor approximation is a
local approximation. That is: it is only based on the value of a function and of its
derivatives at a point. The second is that the Taylor approximation is a polynomial
approximation. Whatever the function to approximate the approximation is made with
a polynomial. This implies that, the more the function and its derivative change in a
given interval, the less the approximation shall work. (think, for instance, about the
approximation of sin(1/x) for x > 0 but not so far from it. Moreover some properties of
polynomials can be very different with respect to properties of the underlying function.
For instance:sin(x) is bounded for any value of the argument (it oscillates between -1
and +1). On the contrary any non constant polynomial is always unbounded. sin(x) is
a periodic function but no (non constant) polynomial may ever be a periodic function.
Other simple but interesting facts can be derived by thinking about Taylor approximations of functions that actually ARE polynomials. While the approximation shall
be perfect if its order is greater than or equal to the approximated polynomial, could
be quite bad if truncated at a lower order. For example: let f (x) = ax + bx2 . The first
order Taylor approximation in x = 0 is ax so that the error is the full bx2 which could
be huge even near x = 0 if a is small compared to b.
In our case, under some stationarity hypothesis of δt+1 nothing really bad should
happen if we use the approximation for a single point in time and when δt+1 is near the
approximation point (typically given by δ = E(δt+1 )). Even in this case, however, if
the price dividend ratio is not equal to δ we shall have a positive error, that is: the
∗
approximation shall always undervalue rt+1
. This is not a problem if we do not sum
many of these errors.
If we take into account the remainder/error term and call it et+1 we get
∗
Rt+1
= k − ρδt+1 + δt + ∆dt + et+1
287
From this we have
∗
δt = ρδt+1 − k − ∆dt + Rt+1
− et+1
The same iteration as above yields
δt = ρ
m+1
δt+m+1 +
m
X
∗
ρj (Rt+j+1
− ∆dt+j − k − et+j+1 )
j=0
Hence, if we suppose, again, that the summation converges (in some sense) as m
goes to infinity we have
δt =
∞
X
∗
ρj (Rt+j+1
− ∆dt+j − et+j+1 ) −
j=0
k
1+ρ
Taking the expected value conditional on t we get now
δt =
∞
X
∗
ρj Et (Rt+j+1
− ∆dt+j − et+j+1 ) −
j=0
k
1+ρ
We see that, if we take into account the error term, the expected value is now always
smaller than what we could get from the approximate formula. In the discounted
expected dividend changes interpretation given above this amounts to say that the
∗
∗
and the sum of
to Rt+j+1
error acts decreasing the risk premium from Rt+j+1
P∞− et+j+1
j
the error terms appearing in the right hand side shall be j=0 ρ (et+j+1 ). Under the
stationarity hypothesis we made the expected value of this can be bounded as:
∞
X
j=0
ρj Et (et+j+1 ) ≤
∞
X
1
V ar(δt+1 )
ρj V ar(δt+1 ) =
8
8(1 − ρ)
j=0
If the expected dividends are small with respect to the price, that is: if ρ is near 1
(remember: ρ = 1/(1+eδ ) ) the overall error could be quite big. This would imply that
the above interpretation of the current log dividend price ratio as the expected value of
future discounted differences between expected log returns and expected dividend log
returns would severely overestimate the current dividend price ratio. It is also useful
to remember that, being the observable δt a sum of the approximation and the error it
may well be that the stochastic properties of δt are quite different than the properties
of the approximation.
In fact this analysis is still incomplete as the error term is only formally introduced
in a difference equation which is still believed to be linear. What we did above is to
study the implication of substituting the error for each observation in a linear difference
288
equation which we held as correct. This is not the case. The error itself contains terms
which depend on δt+1 . If we explicit this we have:
∗
δt = ρδt+1 − k − ∆dt + Rt+1
−
eδL
(δt+1 − δ)2
2
= η1 δt+1 + η2 δt+1
+ η0,t
(1 + eδL )2
2
While it should be noticed that δL itself depends on δt+1 . In any case, even supposing δL constant we have a quadratic difference equation of which we should study
the dynamic properties. This could be quite a problem as this quadratic equation,
depending on the values of the parameters, could be dangerously similar to those of
the logistic equation and so completely different from those of a simple linear equation
(the approximation). To understand this try iterating the formula.
As far as I know in the (huge) literature using Campbell and Shiller approximation,
the quality of the approximation is usually studied in a point wise way (observation by
observation) where it works at least when the dividend price ratio has small variance
(and now we know why), while the difference equation is applied over long (possibly
infinite, as above) time horizons. A partial exception can be found in “The LogLinear Return Approximation, Bubbles, and Predictability”: Tom Engsted, Thomas
Q. Pedersen and Carsten Tanggaard, JFQA (2012).
20.14
*Cochrane toy model.
20.14.1
*Forecasting stock returns: to make a long story short
The problem of the forecastability of stock returns has (obviously) a long story inside
and outside academic Finance literature.
Since markets exist and prices of goods have been seen to change in time, humans
have been interested in two related jobs: reducing the risk of losing due to price fluctuations (risk management), forecasting such changes in order to invest in the good
whose price would go up before this happened (speculative trading).
About risk management. The Ecclesiastes 11.1-3 says “Cast your bread on the
surface of the waters, for you will find it after many days. Divide your portion to
seven, or even to eight, for you do not know what misfortune may occur on the earth.
If the clouds are full, they pour out rain upon the earth; and whether a tree falls toward
the south or toward the north, wherever the tree falls, there it lies. He who watches the
wind will not sow and he who looks at the clouds will not reap. Just as you do not know
the path of the wind and how bones {are formed} in the womb of the pregnant woman,
so you do not know the activity of God who makes all things”. Which, according to
experts, means in modern terms: “ trade your goods because you’ll get reward from
trading. Do not invest everything you have in just one project because you do not
know which project shall be successful. Some events are random and unforecastable
this not withstanding you must take decisions and act”.
289
About speculative trading. Thales of Miletus (fl. 332 B.C.) is told to have bought
the rights to use oil presses before an harvest and resold these to olive producers with
huge profit.
In the Ecclesiastes quotation, an interesting point is the connection set between the
will of God and randomness.
This is a very common finding in antiquity. Very roughly speaking a notion common
to many different and distant cultures was (and somewhat still is) this: the God’s will
is made manifest in what we cannot forecast. For this reason any event we would term
“random” had a “divine” content and, in fact, augures (not to be confused with oracles)
derived their wisdom from a detailed observation of these random events. In this sense
gambling had a sacred content as “luck” was nothing but a manifest sign of the favor
of the God. For this reason version of objects we connect with gambling are commonly
found on sacred grounds (astragalos, throwing sticks, dice, fortune bones etc.).
There is a corollary to this which seems to have been (and for some still be) quite
widespread: since God manifests itself in gambling it would be sacrilege to try and
“forecast” God’s manifestation in gamble. This is one of the reasons which are suggested
to explain the mystery of the fact that peoples mathematically advanced, the like
of Egyptians, Sumerian, Greek and Chinese, while spending most of their time and
fortunes in gambling did not develop any even elementary form of Probability Calculus,
while possessing all the basic mathematical tools to do so.
This “divine” attitude toward randomness had its critics. For instance Cicero in
“De divinatione”, a dialogue we usually do not study in high school where his counseling exploits are preferred, discusses with remarkable clarity very modern ideas about
“randomness” and “chance”. Among many interesting points, for instance: a definition of random event as an event whose happening or not cannot be determined by
what is known before the event and the qualitative if not quantitative understanding
that it is easier to get by chance a given result with two dice that to get by chance
a beautiful painting by randomly throwing colors on a canvas (amazingly modern in
this!), he also states a very important principle according to which, since something
is forecastable only if it depends on what happens before the event and study of the
connection between “causes” and “effects” require time and thought, “forecasting” when
this is possible, should be left to experts, while aruspices dabble in trying to forecast
what actually is random.
A big innovation was made by the Christian and Islamic religions which, both
in view of debasing competing creeds and as a one way recovery of part of biblical
teaching, practically forbade gambling, fortune telling and lending at an interest (and
here the point is that these three things were considered connected). In some extreme
versions any monetary saving or even any saving that was not that of seeds for the next
year, was considered blasphemous. Francesco from Assisi’s idea is even more extreme:
we are part of nature and if we behave like the rest of nature we shall be provided by
God in the same way he provides birds and flowers. So any saving is in some sense a
290
lack of faith in divine providence.
Something was saved of antiquity: God’s will would not and could not be forecasted. The only way to be preserved by its wrath was prayer, not forecasting or risk
management. However God shall be fruitful for those who follow its will. So, similar
to ancient beliefs on gambling, if things go well for you this is a sign that God is with
you.
This is an interesting point which begins to be more and more relevant with the
end of middle ages. In a full agreement with the Ecclesiastes, and somewhat recovering
ancient ideas about gambling, merchants did really believe that to take a risk, not by
gambling but by “venture” was a way to put one’s virtues in front of God and success,
or defeat, was a proof of one’s qualities not only as a merchant but as a good Christian
(or Muslim).
Today it is difficult to understand the difference perceived by our forebearers between making money by trading goods in Europe fairs, speculating about the future
price of a good (fine) and buying it in excess if you believe in and increase of its price
(suspect) or lending at interest to people willing to deal in these two activities (forbidden). However many contemporary widespread attitudes w.r.t. banks, financial
markets, mortgages etc. (in particular those common ideas which resurface during
times of crisis) can be understood only in the light of these ancient stereotypes.
Fast forward to US financial markets at the end of nineteen century. In the
“stocks for the long run” section we discussed some properties of a famous dataset
by Shiller (http://www.econ.yale.edu/~shiller/data.htm). This dataset contains yearly
and monthly data about stock and bond indexes, in real and nominal terms, from 1871
to present day. 2500 and more years. Moreover data is available on dividends and
earnings.
By itself, a detailed study of this dataset properties, how it was compiled, which
statistical problems are implied by the construction procedure, and so on, would require
the time of this course and more. We are completely avoiding these important points
and shall simple use this dataset in order to discuss some points related with stock
return forecasting. Since our purpose is just to introduce Cochrane’s paper in the next
section we shall be very quick in doing so.
Some academic interest in financial markets has existed for lots of time and with
very advanced results. However it is only during the fifties of the past century that
what is today defined as the study of Finance (at least in an academic setting) moved
its first steps.
While most of practitioners were (and are) interesting in stating the ”prospects” of
possible investments, the first studies in Finance were dedicated to another objective.
In the market we observe stocks whose returns show completely different properties in
terms of distribution of future returns (at least as these properties can be estimated
using past data). So the initial question was different, and more interesting: “why all
these very different prospects of returns are traded and find a price every day in the
291
market”. “Why is it not the case that only the “best” survive and the rest disappear”?.
In other words the question, phrased in simple terms connected with just one statistical properties, is “How do we explain the cross section of expected stock returns”.
Note: the question gives for granted that some estimate of expected return (and
more) is possible. It does not question such estimate and concentrates how how it is
possible that stocks with different (but known) expected returns are all traded at the
same time.
It is not here the place to deal with this question and with the advancements which
start with Sharpe and evolve during 30 years to Fama and French and beyond. As the
reader knows the basic idea to solve this puzzle is that different expected returns are
reasonable in equilibrium if they are the result of different amounts of non diversifiable
risk paid at a rate that only depends on the specific risk.
An increasing number of empirical studies were directed to proof or confute such
theories and, at least beginning with the eighties, these studies began to have a noticeable influence on actual asset management procedures.
Up to at least the first half of the seventies most academic students of Finance
shared the idea that, at least for stock prices, a simple log random walk model was
all that was needed. Once you knew expected value, variance and covariances of stock
returns, this was it. Most of the studies concentrated on expected values, then, during
the eighties covariances and variances were also studied in detail and the Gaussian
hypothesis was partially removed.
However: no forecasting, just random walk. While some studies indicated the
possibility of return predictability this was considered more a puzzle than a feature of
return. Something to explain out, not something on which to build.
In part, at least up to somewhen in the middle of the seventies, this came from the
idea that existed some theorem of Probability which told you that “properly forecasted
prices follow a random walk” so that forecastability was seen as an instance of the
abhorred “free lunch” which took irrationality in Economics.
This is not correct but this is not the place to discuss such ideas.
It is also to be admitted that the quality of financial data available to academic
researchers during this time was quite poor. Moreover such data mostly referred to the
period between 1950 and 1980 which, in the light of what did happen afterward, could
be considered a very strange and uninteresting period.
Things changed during the second half of the eighties. See for a very relevant
example: Fama, E. F., and K. R. French. 1988. Dividend Yields and Expected Stock
Returns. Journal of Financial Economics 22:3–25. More and more high quality and
detailed data became available. More powerful statistical tools were conceived and low
cost computers on which to run them became a commodity.
What is more important, markets changed, cold war ended, financial crises (even
in USA) were back, sovereign debts became gigantic and happily traded. Students of
stock prices begun observing “anomalies” and ended in finding “properties”.
292
These anomalies/properties were, at first, included in standard cross sectional models by devising more and more risk factors which determined expected returns (see for
example the series of Fama and French papers starting in the late eighties). More and
more these anomalies begun to take the form of expected returns or other moments of
the return distribution whose value changed conditional on the available information.
In the meantime, the almost religious abhorrence of forecastability (see above for
historical precedents) was at leas partially tamed in two very different ways.
The first is the understanding that conditional expected values (variances, covariances) changing in time did not mean “free lunches”.
The second is the more and more widespread idea that economy and finance are,
after all, driven by true human behavior and that any simple description of such a
behavior by some naive “rational” model of identical agent was perhaps too narrow a
paradigm. Behavioral Finance was the result of such ideas.
The expository paper by Cochrane we are going to discuss is to be intended as a
review and an attempt to systematize a part of the efforts toward studying stock return
predictability as an actual possibility.
A quick word about data. In most papers long run analysis of returns are done
using some version of a total return, inflation corrected, U.S. stock index.
The two most frequently used datasets are:
• The already mentioned Shiller dataset from “Irrational Exuberance”.You can find
both an yearly and a monthly version it here: http://www.econ.yale.edu/
~shiller/data.htm
• The dataset used by Cochrane in his “Dog that did not Bark” paper, and the
MATLAB programs for the computations in the paper are in: http://faculty.
chicagobooth.edu/john.cochrane/research/Data_and_Programs/The_Dog_That_
Didnt_Bark/
• A dataset similar but not identical to Shiller’s and containing monthly, quarterly
and yearly data:http://www.hec.unil.ch/agoyal/docs/PredictorData2012.
xls
We should go into the details of the datasets but we won’t, here. These are all backward
reconstructed index data. For this reason they are fraught with statistical problems
connected with, but not limited to, the evolution of the index compositions.
We do not consider these important points here. However we suggests the student
to compare the datasets and try to find reasons for the non negligible differences. In
particular be very careful when comparing data on different time periods. For instance,
the monthly version of Shiller’s data dividends per month are not actually computed
but interpolated from yearly dividends. For this reason monthly total returns shall be
less variable than they should (we shall briefly comment on this in what follows).
293
It is extremely relevant that anyone interested with the history of USA stock market behavior be acquainted with the relevant historical data and with the problems
intimately connected with collection of historical data over relevant time spans and
this is a good place to start.
20.14.2
*A quick look at the data and some sketch of hypothesis
In order to show how this approximation is currently used in the analysis of stock
returns, with all its pros and cons, we present here a critical analysis of a famous (and
not too hard from the technical p.o.w.) paper by J. H. Cochrane.
In Cochrane, J. H. "The Dog That Did Not Bark: A Defense of Return Predictability." Review of Financial Studies, 21 (2008), the Author discusses some interesting
points wrt the forecastability of stock total returns, in particular on the basis of the
dividend price ratio.
Cochrane tries to put some structure on the problem, and suggests a simple model
for the dynamics of log returns (total returns), log dividend price ratio and log dividend
growth is introduced.
(To simplify notation from now on, and to be consistent with Cochrane’s notation,
in this section we indicate the total log return with r instead than with R∗ ).
rt+1 = ar + br δt + rt+1
∆dt+1 = ad + bd δt + dt+1
δt+1 = aδ + φδt + δt+1
This can be understood as a simple rule of thumb model for forecasting returns and
dividends and, at first, it is used in an informal, exploratory way in order to state our
abilities of forecasting stock total returns (rt+1 ). A direct naive estimate of this model
based on yearly data (1926-2004) gives the following results72 :
yearly data estimate est-stdev t-test P-value
R2
br
0.092039 0.051225
0.0763
0.04
bd
0.007486 0.029968
0.8034
0.0008
φ
0.945744 0.043919
0.0000
0.8575
This kind of analysis uses data in real terms, we do not comment here about this use which could
be a good first approximation. The data we use are from Shiller “Irrational exuberance” and the
results are similar but not identical to what you get using data by Cochrane. We use Shiller data
because a monthly version is available, recall the warning we just made about how monthly dividends
are computed. Roughly speaking Shiller reconstructs the S&P composite index while Cochrane uses
the CRSP index. Results with other datasets are qualitatively similar but quantitatively different. In
particular the estimated standard deviation for br may be quite different across different datasets. As
usual a long term effect may only be estimated with long range data and, after all, modern Finance
is a quite young phenomenon.
72
294
Before commenting the results we repeat the same analysis with monthly data from
the same source and time period:
monthly data estimate est-stdev t-test P-value
R2
br
0.004088 0.003446
0.2358
0.001486
bd
-0.003013 0.000890
0.0007
0.011968
φ
0.995886 0.003545
0.0000
0.9881
It is to be noticed that the monthly dividends are computed by interpolation in
Shiller data. Since what is interests us are the main characteristics of the results this
is not a big problem. However, if we wish to go into details, this should be a point to
discuss.
A first simple reading of these exploratory results is as follows:
1. We can forecast a (small) part of the variance of future returns even using only
the dividend price ratio. Forecastability seems to increase passing from monthly
to yearly data.
2. The rate of change in dividends seems independent on the dividend price level
3. The price dividend ratio is very persistent, that is: future price dividend ratios
are easy to forecast on the basis of past price dividend ratios.
Forecastability, even at yearly level, seems an almost negligible phenomenon but: do
not be fooled by the 4% R2 . An interesting possible consequence of 1+3 is that it
should not be much more difficult to forecast single period returns more far in the far
future than a single period return just one period ahead on the basis of price dividends
since price dividends are very persistent. In fact high persistence of log dividend price
ratios means these are easy to forecast for a rather long horizon. So we can apply our
4% R2 not just to the next return but to a rather long stretch of future returns.
Now the magic: the fact that a part of the variance of returns can be forecasted
using a very persistent series opens another interesting possibility.
Since this happens, this implies that a part of returns variance is due to a persistent
(very much autocorrelated) component.
However, if we consider returns on, say, a year, return themselves are not persistent:
a regression of current returns on lagged returns gives an estimated slope of 0.076306
with a standard error of 0.113613 and a p-value of 0.5038 corresponding to an R-square
of 0.005. It is then possible to write each (say) monthly return as the sum of a (small
variance) persistent (autocorrelated) component, due to the dependence from the price
dividend ratio and a (big variance) non autocorrelated component.
Let’s write this as rt = at + wt where we suppose that between the two components
correlation is 0.
We know that log returns over many time periods (say n) are the sum of log returns
over subperiods. This is true for simple log returns and also for log total returns with
the reinvestment convention we used above.
295
P
P
P
In this case rnt = nt=1 rt = nt=1 at + nt=1 wt
In this case the variance of a return over n time periods shall be given by (supposing
perfect autocorrelation between the persistent components and zero autocorrelation
between the non persistent components)
V (rnt ) = V (
n
X
t=1
n
X
at ) + V (
wt ) = n2 V (a) + nV (w)
t=1
Pn
Under
these
hypotheses,
in
fact,
(assuming the expected value
t=1 at = a1 n
Pn
Pand
n
of t=1 wt to be equal to 0) a regression of rnt on t=1 at = a1 n shall give a beta of 1
(a)
which tends to 1 if n increases.
and an R2 = nV nV
(a)+v(w)
For instance, deriving rough numbers from our data, if V (a) = 1.5 and V (w) =
998.5, while for 1 time period only 0.15% of the variance of returns is due to the
forecastable a component, for, say, 12 time periods the ratio becomes 216/(216 +
11982) = 1.77%. If we enlarge the time interval even more, say to 120, we get
21600/(21600 + 119820) = 15.27%. The longer the horizon the better the forecast
because the variance of the forecastable part grows with the square of n and the rest
with n. Actually, things could be even better. .
If this is true, by forecasting price dividend ratios and using these to forecast one
period returns then summing these up, we should be able to forecast a sizable part of
log total returns over longer term periods. The longer the length of the time period,
the bigger the forecastable part.
It is important to understand the exact meaning of the sentence “by forecasting
price dividends ratios”. If we are at time t and wish to forecast the return over n time
periods we have two possible solutions. The first is to do what Cochrane does and what
we did a moment ago, that is: estimate a model that connects rt+1 to δt and a model
which connects δ t+1 with δt . Then, since rt+m is, with the proper assumption of dividend
reinvesting, the sum of m one period total returns, and each of these one period returns
depends on the corresponding price dividend ratio, use the price dividend ratio model
to forecast the m intermediate price dividend ratios (in the Cochrane simple model each
of these shall be a function of the same δt ) then use each of these forecasts to forecast
the corresponding one period return and get the estimate of the n period returns by
summing these forecasts. The second possibility is to take return data at periodicity
m, regress these on the m corresponding one period price dividend ratios then, at time
t, forecast these price dividend ratios and use the forecasts in the model for rt+m .
It is clear that the second possibility is less useful that the first as, the greater
m, the smaller the available sample and the worst the estimates. However the first
possibility heavily depends on both the approximation of m period returns based on
sums of 1 period returns and on the single period models for rt+1 and δt+1 .
Let us now go back to our simple set of assumptions. Obviously this is a very quick
sketch. For instance, while in our data we see that it is actually easier to forecast
296
monthly than yearly returns using dividend price ratios, the R2 goes from about 0.15%
to 4%, not to 1.77%. This means that, at least passing from monthly to yearly returns,
forecastability improves more than expected in our very simple sketch.
Surely we are forgetting something important, probably many important things.
A possible explanation is that data on dividends are quite sparse during the year and
this could have an influence in degrading forecasts within the year which is not there
between years. Moreover wt actually could be not uncorrelated but slightly negatively
correlated. If this is true, the variance of their sum shall increase less than we expect.
Moreover variances could depend on time and so on.
We can check further our understanding of the data passing from yearly to, say 10
year returns. In this case we should pass from a 4% R2 to something in the range of
29% (400/(400+960)=0.294), The obvious problem is that we only have 9 data points
in this case (we go up to 2006). This lack of data, by itself, could generate an positive
bias in the estimate of the R2 . However what we get if we run the regression on decadal
data is an R2 of 0.38 and if we consider the estimate of R2 corrected for the degrees of
freedom we get 0.297 which is even too good to be true.
Another possible problem is as follows. It has long been known that the variance
of returns grows less than linearly. That is: the variance of returns over, say, 10 time
periods is less than 10 times the variance of returns over 1 time period. The above
sketched argument would imply, instead, that the variance of returns should increase
faster than the number of periods over which returns are computed.
In our dataset the yearly variance of returns is 0.03584 so we should observe a 10
year variance above 0.3584 (equal to this if we had no autocorrelated component). The
actual 10 years return variance is only 0.1274. This may be due to a small sample size.
However the monthly return variance is 0.002094801 so we should expect something
like 12 times this for one year, that is 0.02514 while we get 0.03583449 which is higher.
So, the picture is not as clear cut as we would like it to be. It may well be that
w contains some component with a sizable negative autocorrelation over the long run
and/or a is not simply perfectly autocorrelated (we know it is not) but has a more
complex autocorrelation structure with strong positive first order component but negative second order component. It may also be that the interpolated nature of monthly
dividends somewhat alters the picture.
Be it what it be, what is clear is that the study of return forecastability, simply
ruled out buy orthodox Finance journals up to the first half of the nineties, is back
with a vengeance and much interest.
Orthodox literature is now full of papers which try to explain this forecastability,
to find variables other than the dividend price ratio able to explain it. Consumption
is a strong candidate in view of standard consumption and investment models and
considering its huge autocorrelation but others are entering the fray each month. One
of your teachers favors variables connected with life expectancy while another believes
that some long run component in consumption does the trick.
297
The writer of this handouts well rests among those who believe it to be very difficult
to directly test for long run forecastability in financial data, this simply because we do
not have long run data. Finance as we know it is rather a new phenomenon and we
can at most study long run implications of short run models, as Cochrane tries to to,
but cannot test forecastability hypothesis directly on non existent long run data.
Be careful: we are dealing with long run forecastability. The philosopher’s stone
of a (legally admissible) machine for forecasting with success tomorrow stock prices on
the basis of public information was and remains a chimaera.
Moreover the Reader shall not forget that we are trying to study long run properties
as consequences of short run models. Finance as we intend it is still a relatively recent
phenomenon dating back no more than a couple of centuries so there exists very little
direct information on long run properties of returns. The problem is that the study of
long run properties of short run models is a study of consequences of short run estimates
not a direct estimate of effects. Moreover, the evaluation of such long run implications
of short run models, and this shall be of paramount relevance for us, must typically be
based on the iteration of a short run model (typically a stochastic difference equation)
over many time periods. If the short run model is based on some approximation which,
on each single period, has maybe negligible error, but whose errors sum up to sizable
totals over many periods (this should ring a bell!) problems may and do arise.
In what follows we shall try to clarify how much the linearization of log total returns
interacts with Cochrane’s model and can help or hinder the study of the long run return
forecastability problem.
20.14.3
*A more detailed look to Cochrane’s model: one step ahead forecasts
In fact, this apparently simple model looks quite strange if you but think about it a
little.
Log total return rt+1 (recall the change in notation from R∗ to r) is a function of
dividends and prices, the log dividend price ratio δt+1 is a function of dividend and
prices and the log rate of increase of dividends is a function of dividends.
It is reasonable to model somewhat independently two of this quantities not all
three of them. Three equations for only two underlying variables is a little too much
and we can see this, without any approximation, in what follows.
Let us start from the definition of log total return
rt+1 = ln(1 + eδt+1 ) + ln
Pt+1
Pt
hence
rt+1 − ∆dt+1 = ln(1 + eδt+1 ) − δt+1 + δt
298
and if we use in this identity the equations in the model we have
δ
ar + br δt + rt+1 − ad − bd δt − dt+1 = ln(1 + eaδ +φδt +t+1 ) − aδ + (1 − φ)δt − δt+1
δ
rt+1 − dt+1 − ln(1 + eaδ +φδt +t+1 ) + δt+1 = ad − ar − aδ + δt (1 − φ + bd − br )
This is an identity, so it must always be true. Actually this is a very nasty stochastic
nonlinear identity where, at time t, the rhs only contains non stochastic terms while
the lhs contains functions of three stochastic variables (the epsilons). Hence, as we
could expect, since we have three random variables functions of which must satisfy an
identity (the definition of log total return), we have that there must me some nonlinear
but exact, that is: functional, dependence between the three errors in the model. which
can be true only under very peculiar hypotheses on the error terms (for instance, as we
can expect, two of them imply the third). This dependence is what is described in the
above formula: if we observe δt we can solve for the third given the other two (pick
your choice). It is clear that this identity creates problems in the “simple” model by
Cochrane as it implies that the epsilons are dependent on the value of δt ruling out the
interpretation of the Cochrane model as a regression model (there exist dependence
between the errors and the regressors).
Since we have a functional dependence among the epsilons (and deltat ) all their
stochastic properties shall be related. This implies that we cannot simply estimate
the above model with, say, a VAR estimate without imposing the exact nonlinear
restriction implied by the definition of log total return.
A good way to express this identity in the Cochrane model would be to change the
first of the three equations in the following (very nonlinear) way
δ
rt+1 = ad − aδ + (1 − φ + bd )δt + ln(1 + eaδ +φδt +t+1 ) + dt+1 − δt+1
Let us now pass to our approximation. Nothing changes except the fact that we
shall use an approximate definition of log total return.
rt+1 = k − ρδt+1 + δt + ∆dt
and we write this, coherently with what we just did as
rt+1 − ∆dt = k − ρδt+1 + δt
to be compared with the exact
rt+1 − ∆dt+1 = ln(1 + eδt+1 ) − δt+1 + δt
Now everything is linear (we are forgetting the error) and we get
ar + br δt + rt+1 − ad − bd δt − dt+1 = k − ρaδ + (1 − ρφ)δt + −ρδt+1
299
Again we have a (now linear) exact functional dependence between the errors
rt+1 − dt+1 + ρδt+1 = k − ρaδ + (1 − ρφ)δt + ad + bd δt − ar − br δt
Proceeding as before we could substitute this identity in the first equation of the
model getting
rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1
This must be compared with the first equation in Cochrane model
rt+1 = ar + br δt + rt+1
So we have two versions of the equation in Cochrane’s model. The first is implied
by the other two equations and the assumption that the approximation be exact. The
second is the original version. If the approximation AND the model both work we
should have that, whatever be δt the results of the two equations be the same. This
imposes constraints to the coefficients and the error terms of the two equations.
Let us now proceed as Cochrane. Suppose that the expected values of the epsilons
are zero (conditional on δt ). If we take the difference of the expected values of the two
versions of the same equation and require this difference to be 0 we have
0 = k − ρaδ + (1 − ρφ)δt + ad + bd δt − ar − br δt
Since this must be true whatever δt be, we need for both the sum of the constant
terms and the sum of the slope terms to be equal to 0 that is
0 = k − ρaδ + ad − ar
0 = 1 − ρφ + bd − br
The second constraint is studied in the quoted paper by Cochrane as a necessary
condition for the approximation and the three equation model to be consistent.
A more correct (if less “nice”) reasoning is as follows (there is an hint to this in
Cochrane’s paper). Suppose you want both the three equation model and the approximation to work whatever be the value of δt with epsilons independent on δt . In this
case you require
0 = 1 − ρφ + bd − br
(for independence on δt ), and
rt+1 − dt+1 + ρδt+1 = k − ρaδ + ad − ar
300
So that the errors satisfy the (approximate) linear constraint. We clearly see that
the simple test of H0 : 0 = 1 − ρφ + bd − br by itself is not sufficient as a test for the
joint hypothesis of validity of both the approximation and the three equations model
(with independence of errors and δt ).
What Cochrane actually does is not checking if the approximation works with the
model but if the approximation works on average with the model.
A good test for this hypothesis would be given by the following procedure.
First estimate the three equation model with the added constraint (we put back
the constants which in Cochrane’s paper for some reason are forgotten)
rt+1 − dt+1 + ρδt+1 = k − ρaδ + ad − ar
Or, that is the same, estimate the three equations model after substituting the first
equation with
rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1
Second check for the condition
0 = 1 − ρφ + bd − br
using the estimated values of the parameters. Remember that this second condition
fully comes from the idea that the -s should not depend on δt .
However we still have a problem: when we estimate the three equations (constrained) model we use “true” log total returns: rt+1 . The constrained first equation
rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1
only truly applies to the approximate returns. As we did see above, true log total
returns should be used not in this equation but in the nonlinear equation
δ
rt+1 = ad − aδ + (1 − φ + bd )δt + ln(1 + eaδ +φδt +t+1 ) + dt+1 − δt+1
For this reason we must conclude that the only possible “test” for the approximation
and the three equation model comes from estimating the nonlinear three equation
model, compute ρ and k according to their definitions, substitute all parameter values
and epsilon values so estimated in the approximate equation for rt+1
rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1
and see if the fitted values obtained by this equation are similar to the true rt+1 series.
This may seem complex as the nonlinear first equation is difficult to estimate but
we must understand that, since the constrained first equation does not contain any
parameter or residual that does not come from the other two equations, we really do
301
not need to estimate it (in technical terms it is not a constrained but a redundant
equation) so that the procedure is very simple. Estimate the two remaining equations,
δ
compute k and ρ by expanding the term ln(1 + eaδ +φδt +t+1 ) (typically at the point
δ = δ¯t or similar) and do as previously advised. An even simpler procedure start from
the same estimates, computes
rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1
leaving k and ρ unspecified and then regress rt+1 −ad −(1+bd )δt −dt+1 on a constant
and aδ + δt /φ + δt+1 and look to the R square of this regression which should be very
high (in principle equal to 1) in order to justify the Taylor approximation in the context
of Cochrane model.
Cochrane model is intended for a simple study the forecastability of log total returns.
In a naive use of the model this is a question whose answer lies in the estimation of br .
In Cochrane approximate, constrained version this has to do with the value of
1 − ρφ + bd .
In the full version of the model the forecasting properties are more interesting. the
ability of the log dividend price ratio for forecasting returns has a linear component
which depends on the value of 1 − φ + bd and a nonlinear part which depends on
δ
ln(1 + eaδ +φδt +t+1 ). The difference between the true and the approximate parameter
for δt , that is: ρφ − φcomes fully from the linear approximation of ln(1 + eaδ +φδt ) .
Notice the implicit hypothesis of expanding δt+1 around its expected (conditional
on δt ) value aδ + φδt . Moreover notice that ρ shall be a function of φ.
For these reasons, supposing φpositive and smaller that 1 and bd positive, while
in the approximate model, we have that the effect of an increase of δt on the expected
value of rt+1 is positive and the same for any level of δt , in the full model we have that
this effect shall be smaller δt is smaller and bigger the bigger is δt . For positive values
of φ we shall have an effect that goes, roughly, from 1 − φ + bd when dividends are near
zero to 1 − φ/2 + bd in the limit case where price and dividends are the same. For this
reason in the approximate version the ρ is expected smaller than, but near, 1.
20.14.4
*A more detailed look to Cochrane’s model: iterate over many
time periods
In section 2.2 of his paper Cochrane computes “long run” coefficients. Actually we have
two kinds of long run coefficients, Cochrane only considers one but here we shall deal
with both. Let us do it step by step.
We begin by rephrasing Cochrane argument (we hope this version shall be clearer
than the original).
The starting point is the iteration of the linearized log total return (here we continue
302
to use r for the log total return)
∞
X
∞
∞
X
X
k
k
δt =
=
ρj rt+j+1 −
ρj ∆dt+j −
ρ (rt+j+1 − ∆dt+j ) −
1+ρ
1+ρ
j=0
j=0
j=0
j
Now multiply on the left and on the right by δt − E(δt ) and take the expected value
(and remember that the expected value of the difference from the expected value is 0)
∞
∞
X
X
j
V (δt ) = Cov(
ρ rt+j+1 ; δt ) − Cov(
ρj ∆dt+j ; δt ) =
j=0
=
∞
X
j=0
j
ρ Cov(rt+j+1 ; δt ) −
j=0
∞
X
ρj Cov(∆dt+j ; δt )
j=0
Where the second equality simply depends on the exchangeability of the expected value
operator and of the sum operator (really this is not a sum, it is a series, and we are
supposing something more than this but it is not the case here to be too picky).
This is obviously an identity if the required moments exist and if we forget the
approximation error. An identity is always true so, please, do not try to give it any
economic interpretation.
If we divide everything by V (δt ) we get
1 = β(
∞
X
∞
∞
∞
X
X
X
j
j
ρ rt+j+1 ; δt ) − β(
ρ ∆dt+j ; δt ) =
ρ β(rt+j+1 ; δt ) −
ρj β(∆dt+j ; δt )
j
j=0
j=0
j=0
j=0
Where β(a; b) = cov(a; b)/V (b) is the univariate regression coefficient between a and
b. It is interesting to notice that this equation tells us (but this is again a tautology,
if everything converges in the right sense) that the regression coefficient we get by
regressing the series of future (discounted with ρ) total log returns on the current log
price dividend ratio is identical to the series of the (discounted) regression coefficients
of each future return on the current log dividend price ratio.
If we now apply the three equations Cochrane model, we are easily able to compute
these betas as, by direct substitution, we have
β(rt+j+1 ; δt ) = br φj
β(∆dt+j ; δt ) = bd φj
it is then easy to compute
∞
X
j=0
ρj β(rt+j+1 ; δt ) =
∞
X
j=0
303
ρj br φj =
br
1 − ρφ
And a similar formula for the other parameter which shall be less relevant because we
know that the estimate of bd is very near 0.
This is what Cochrane calls long run coefficient. The reason for this name is that
it is the value (given the model and the approximation) of the regression coefficient of
the series of future (discounted) returns on the current log dividend price ratio. By
repeating the arguments in the previous section we should check the quality of the
model and the approximation by verifying if the constraint implied by the approximation actually hold but, since no direct estimate of the long run coefficient is possible,
this would tell us nothing of relevance except the fact that an error in respecting the
constraint would be amplified by the division for the term 1 − ρφwhich is going to be
quite small.
There is a second “long run coefficient” of interest for us, which is more connected
that the former with long run forecastability and it is going to add an interpretation
to Cochrane’s result.
Let us start from the approximated model (the constrained first equation)
rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1
suppose we want to forecast the log total return (with reinvested dividends) over
n + 1time periods
n
X
rt+j+1 = (n + 1)(k + ad − ρaδ ) + (1 − ρφ + bd )
j=0
n
X
δt+j +
j=0
n
X
(dt+j+1 − ρδt+j+1 )
j=0
which, according to the simple model for δt is equal to
n
X
rt+j+1 = (n+1)(k+ad −ρaδ )+(1−ρφ+bd )
j=0
n
X
n
n
X
X
φj δt +
φj δt+j+1 + (dt+j+1 −ρδt+j+1 )
j=0
j=0
j=0
as before multiply on both sides by δt − E(δt ) and divide by the variance of δt . We
get
β(
n
X
j=0
rt+j+1 ; δt ) = (1 − ρφ + bd )
n
X
φj = (1 − ρφ + bd )
j=0
1
1 − φn+1
−→ (1 − ρφ + bd )
1 − φ n→∞
1−φ
(The limit is true is the absolute value of φ is less than 1).
If we wish to compute the R2 of the regression we need the variance of the total
return from t to t + n + 1 conditional to t − n − 1 we have, supposing stationarity and
no correlation across the n
n
n
X
X
X
V(
rt+j+1 |t−n−1) = (1−ρφ+bd )2 V (δt |t−n−1)(
φj )2 + (φj −ρ)2 V (δ )+(n+1)V (d )
j=0
j=0
304
j=0
Recall that, if we regress total return over n + 1time periods over the log dividend
price ratio at the beginning of each period of length n + 1 we just use one log dividend
price ratio each n + 1. So the relevant variance of the log dividend price ratio for
computing the R2 of the regression is the variance of this “interval sampled” series of
dividends and, given the simple autoregressive model for the log dividend price ratio,
this variance shall be bigger that the variance of the same series sampled at each data
point. In fact
V (δt |t − n − 1) = V (δt−n−1
n
X
j
φ +
j=0
n
X
φj δt−n−1+j )
j=0
δ
= V ( )
n
X
φj
j=0
Some comments.
If the model plus the Campbell-Shiller approximation work, we should have (see
above) (1 − ρφ + bd ) ≈ br and, for large n we should then have (suppose, obviously, φ
between 0 and 1 and strictly less than 1)
n
X
β(
rt+j+1 ; δt ) ≈
j=0
br
1−φ
That is, exactly what we find in the Cochrane analysis but without the ρ.
This result states that, the longer the time period over which we compute a return,
the bigger is going to be the regression coefficient of this return on the log price dividend
ratio. This confirms our “back of the envelope analysis” made under the hypothesis
that log dividend price ratios be perfectly autocorrelated but it adds an important
insight: due to the non perfect autocorrelation (φ is less that 1) we have that the
growth of the regression coefficient shall not be linear and unbounded but less than
linear and bounded. with our simple OLS monthly data estimates we get .0041(1 −
.99613 )/(1 − .996) = .052 which, if we consider the size of the standard errors involved
(and the rough treatment of monthly dividends), is not so far from the .092 estimated
with yearly data. With these estimates the limiting value for β shall be equal to
.0041/(1 − .996) = 1.025.
From the above formula for the variance of cumulated log total returns we have
Pn
2
1−R =
(1 − ρφ + bd )2 V
j
2
δ
j=0 (φ − ρ) V ( ) + (n
P
P
(δ )( nj=0 φj )3 + nj=0 (φj
+ 1)V (d )
− ρ)2 V (δ ) + (n + 1)V (d )
Since everything is bounded, except the n in the numerator and in the denominator
which are both multiplied by the same constant, this goes to 1 for n going to infinity,
except in the case φ = 1. In this case the formula simplifies to
1 − R2 =
(n + 1)(1 − ρ)2 V (δ ) + (n + 1)V (d )
(1 − ρφ + bd )2 V (δ )(n + 1)3 + (n + 1)(1 − ρ)2 V (δ ) + (n + 1)V (d )
305
and clearly this goes to 0 as 1/n2 if n goes to infinity.
In intermediate cases (from our data we see that φis very near to 1) we have that,
starting from the value
1 − R2 =
(1 − ρ)2 V (δ ) + V (d )
(1 − ρφ + bd )2 V (δ ) + (1 − ρ)2 V (δ ) + V (d )
The value of 1 − R2 shall first decrease and then increase with a limit of 1.
Summing up, if the approximation and the simple model work, we have that log
dividend price ratios shall have increasing forecast power on log total returns if these
returns are computed over longer time periods up to some time horizon depending on
the parameters of the model. Notice that the smaller is the variance of d (which,
according to our empirical results shall be almost the same as the variance of ∆dt )
the longer the time interval over which the forecasting power of the regression shall
increase. In the limit, if this variance is 0, that is, if the dividend flow is constant,
1 − R2 decreases to a lower bound which is smaller the smaller is V (δ ). This is not
surprising. In fact, if we start from the approximation
rt+1 = k − ρδt+1 + δt + ∆dt
suppose ∆dt constant (to be simple equal to 0 but nothing relevant changes) and
use the equation for the evolution of δt we get
rt+1 = k − ρ(aδ + φδt + δt+1 ) + δt = k − ρaδ + (1 − ρφ)δt − ρδt+1
and the better you can forecast δ that is, the smaller V (δ ), the better the forecast no
disturbance being induced by random changes in dividends. More simply: if dividends
are constant, forecasting log dividend price ratios is equivalent to forecasting minus log
prices.
21
*Appendix: Some further info about the use of
regression models
The following discussion is a simplified version of some section from Haavelmo (1943)
paper (Trygve Haavelmo: “The Statistical Implications of a System of Simultaneous
Equations”, Econometrics, 11, 1, Jan. 1943, pp. 1-12).
In Economics, as a rule, many variables interact in order to maintain a system in
equilibrium, a typical setting is the supply/demand interpretation of market equilibrium.
let us begin by considering a consumer who confronts a given price P for a good.
For any given fixed price P we could assume that the quantity Q that the consumer
306
decides to buy is given by, say Q = α + βP + e1 where we may suppose e1 to have
expected value of 0. We may understand this as our model for the answer of the
consumer to a set of questions about the quantity the consumer would be willing to
buy for a given unit price where the unit price is not necessarily a market price but is
a price set by us who ask the question. In this setting (where P is not stochastic) it is
easy to see that, for any given P , we have E(Q|P ) = α + βP .
On the other hand, let us see the thing from the point of view of the “seller” (offer).
Suppose the seller is asked for a GIVEN quantity Q, we may suppose the seller shall
require a price equal to P = γ + δQ + e2 . Again let us suppose E(e2 ) = 0. In this case
we may say that E(P |Q) = γ + δQ.
Now, both these results are exact under the hypothesis that, in the first, P is given
and, in the second, Q is given. In a sense we are implicitly considering two experiments:
in the first the consumer is confronted with different given levels of price and asked
for the desired quantity at that price, in the second the seller is confronted with given
quantities and asked for the price. The two random element (e1 , e2 ) take into account
the fact that the answers for the same fixed P or Q shall not be always the same. This
could be, and we suppose it to be, a good description for the specific experimental
setting.
Suppose now we let the market work: we do not set price and ask for quantities or
vice versa, we just observe price and quantities as set by the market. Let us suppose
that, not withstanding the fact that we are no more asking questions but we are now
in an observational setting, equilibrium requires Q = α + βP + e1 and P = γ + δQ + e2
to hold simultaneously. The algebraic meaning of this is that both the Q and the P
in the offer and in the demand function must have the same value (this was absolutely
NOT required when the consumer and the seller answered to our questions).
Moreover, let us suppose that both consumer and supplier do not alter their demand
and supply function due to the fact that they are now in a true market situation and
not simply answering questions. This, as stated in the text, is a very strong hypothesis.
Under these conditions we have a system of two equations in four unknown and we
can express the equilibrium values of both P and Q as functions of e1 , e2 . We have
P = (α + βγ + βe2 + e1 )/(1 − βδ) and Q = (γ + δα + e2 + δe1 )/(1 − βδ). Both P and
Q depend (in different ways) from the same e1 and e2 .
Now: what shall be, say, E(P |Q) in this setting?
Suppose, for simplicity, that e1 and e2 be jointly Gaussian with expected values
both 0 covariance 0 and variances V (e1 ) and V (e2 ).
In this case with some easy computations we find that:
E(P |Q) =
γ + δα
α + βγ
− BP |Q
+ BP |Q Q
1 − βδ
1 − βδ
where:
BP |Q =
βV (e2 ) + δV (e1 )
V (e2 ) + δ 2 V (e1 )
307
This is NOT γ + δQ (except in particular cases). In fact, it would be better to use
different symbols for the conditional expectation describing the answers of the seller to
our questions, and this conditional expectation, which has to do with the information
we can get on the EQUILIBRIUM P given the EQUILIBRIUM Q (we are going to do
this in a moment)..
In other words: even if the underlying hypothesis on demand and supply do not
change, if we pass from the “experimental setting” to the “observational” setting (and,
obviously, vice versa), the relevant regression functions are different.
If I want to forecast P when I confront the seller wit a GIVEN Q (say: “experiment”),
and let us indicate this with Q∗ , the forecast shall be E(P |Q∗ ) = γ + δQ∗ .
However, if, confronted with market data, I want to forecast the price P for a
− BP |Q γ+δα
+
quantity Q in equilibrium (“observational”), I shall use E(P |Q) = α+βγ
1−βδ
1−βδ
BP |Q Q.
What is happening is this: when I confront, say, the seller and ask for the unit
price the seller would require for a given quantity, I am not requiring that the resulting
price quantity pair be acceptable to the buyer. I am simply studying the “offer” that
is: what would be the price asked for a given fixed quantity. This is NOT what we get
from observations of price quantity pairs in the market. In this case only those pairs
which are in equilibrium are observable. If I observe a Q I am not “choosing “ it, I
observe it because it is an equilibrium Q and to this shall correspond and equilibrium
P . The same, Q, if chosen by me or if accepted as an equilibrium value, has different
meanings and yields different informations of the corresponding P even if we suppose
that nothing changes in the parametrization of demand and supply.
This is basic Economics and is a VERY SIMPLE example as, in more realistic
situations, many more variables are considered in the model. It should be useful
to understand the kind of analysis which is necessary in these cases. Notice, again,
that here we did suppose that, in a sense, being in an experimental setting or in an
observational setting “did not change” the “form of the model”. As noted above this is
by no means the usual case.
An aside: BP |Q is undefined if both e1 and e2 have zero variance. This is because in
this case the equilibrium is just one point: a specific P and a specific Q. To be precise:
P = (α + βγ)/(1 − βδ) and Q = (γ + δα)/(1 − βδ) (we suppose the four parameters to
be such that the solution exists and is unique, in particular βδ 6= 1).
Which regression function should we use? If what we wish for is a forecast of
what we can expect P to be, given the Q observed in the market, we should use the
equilibrium regression function.
Suppose, however, that our purpose is to compel sellers to produce a given quantity
∗
Q , in order to asses which price we should pay for this out of equilibrium Q∗ we should
use E(P |Q∗ ) = γ + δQ∗ . It may be interesting to notice that, if this is our purpose,
and sellers know this, they could be induced to answer to our questions about the price
for given quantity in a way that does not reflect what they would do when confronted
308
with market equilibrium. They could be induced to cheat and set, for instance, a
bigger γ just because this would imply a bigger price for any Q∗ . In other words: when
“experiments” are run on “strategic” individuals, these could be non informative about
the individual behaviour outside the experiment.
This is exactly what we supposed NOT to happen when we assumed the demand
and supply functions not to change when elicited by our questions and when considered
in the market equilibrium.
If this invariance does not hold, observational data on equilibrium prices and quantities can be used for forecasts but not for policy evaluation. On the other hand, direct
“experimental” data on supply and demand cannot be used for forecasting purposes
and for intervention evaluation. This problem, whose likelihood is very high if we
assume strategic agents, could be solved if we could model the way in which the behaviour of agents changes according to the different setups (un-tampered equilibrium,
intervention, experimental observation).
The debate about these these fundamental points about the use of Econometrics
has been central in the Econometrics literature in the last 80 years (and, as such, it
has contribute a lot in making the field interesting). Here is some quick reference.
Haavelmo (1943) explicitly contains these considerations.
A special case was considered by Robert Lucas (1976)73 .
The specific case considered by Lucas is that of the different effect on agents expectations of data coming from “natural” market action and of data altered by policy
actions. Lucas’s analysis is usually discussed under the term “Lucas critique”. Even if a
given economic model describes in a correct way the joint probability distributions on
economic observables in an economic system described when intervention is not active,
any policy action could essentially change the system due to the (assume optimizing)
reaction to the intervention by the economic agents. The basic idea of Lucas, echoing
Haavelmo and, more simply, common sense, is that we should explicitly model the reaction of agents to policy actions and how this alters the overall equilibrium behaviour.
If this is not the case, and such reaction alters equilibrium behaviour, the usefulness
of historical data for evaluation policy action is in doubt, and forecasts of “effects” of
policy action could be way off the mark.
The analysis also implies, obviously, that, by converse, data observed when policy
was in action would possibly be of doubtful use in estimating forecast models to be
used when policy is not acting.
Clearly, modeling the reaction of agents to policy action may be feasible at an
individual agent micro level. At a macro level, that is: at the level of the full economic
system, this is very difficult (and this is an euphemism) and possible only conditional to
drastic simplification hypotheses (e.g. the “representative agent”) which may degrade
Lucas Robert (1976). "Econometric Policy Evaluation: A Critique". In Brunner, K.; Meltzer, A.
The Phillips Curve and Labor Markets. Carnegie-Rochester Conference Series on Public Policy. 1.
New York: American Elsevier. pp. 19–46.
73
309
the realism of the model.
Chris Sims (1980)74 , in one of the most quoted papers of Econometrics history,
strongly makes the point that, due to the problems and potential arbitrary choices
involved in the attempt of causal interpretation of econometric models, more care
should be given to build models whose first purpose be forecasting. He also argues
that such models, under reasonable assumptions, could be used for at least approximate
policy evaluations.
Sims elaborates on previous research, in particular on a paper by Ta Chung Liu
(1960)75 and on the strain of research induced by Haavelmo paper.
In recent years, many econometricians, in particular micro econometricians (see
above for the comment on the Lucas critique), have based their interpretation of the
“causal” use of (micro) econometric models on the approach summarized, for instance,
by the work of Judea Pearl76 . This, on its turn, is based on the approach to causal
inference developed by Donald Rubin and his co authors since the ’70es77 .
It is to be noticed that Pearl and Rubin approach explicitly requires, in a very strict
way, the kind of “structural stability” with respect to intervention briefly discussed
above and, as a consequence, while maybe useful at a micro level, is probably not the
best way to deal with macroeconometrics policy analysis.
Sims, Christopher (January 1980) "Macroeconomics and reality". Econometrica. 48 (1): 1–48.
Ta-Chung Liu, “Underidentification, Structural Estimation, and Forecasting” Econometrica, Vol.
28, No. 4. (Oct., 1960), pp. 855-865.
76
A good summary in: Judea Pearl, Madelyn Glymour and Nicholas Jewell, “Causal Inference in
Statistics: A Primer”, Wiley, 2016.
77
See, for a good summary: Donald B. Rubin and Guido W. Imbens (2015), “Causal Inference for
Statistics, Social, and Biomedical Sciences”, Cambridge University Press
74
75
310
Contents
1 Returns
1.1 Return definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Price and return data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Some empirical “facts” . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13
20
22
2 Logarithmic random walk
2.1 "Stocks for the long run" and time diversification . . . . . . . . . . . .
2.2 *Some further consideration about log and linear returns . . . . . . . .
26
35
42
3 Volatility estimation
3.1 Is it easier to estimate µ or σ 2 ? . . . . . . . . . . . . . . . . . . . . . .
46
51
4 Non Gaussian returns
55
5 Four different ways for computing
5.1 Gaussian VaR . . . . . . . . . . .
5.2 Non parametric VaR . . . . . . .
5.3 Semi parametric VaR . . . . . . .
5.4 Mixture of Gaussians . . . . . . .
the VaR
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
65
69
75
79
6 Matrix algebra
83
7 Matrix algebra and Statistics
7.1 Risk budgeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 A varcov matrix is at least psd . . . . . . . . . . . . . . . . . . . . . . .
7.3 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
87
88
89
8 The deFinetti, Markowitz and Roy model for asset allocation
89
9 Linear regression
94
9.1 Weak OLS hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2 The OLS estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.3 The Gauss Markoff theorem . . . . . . . . . . . . . . . . . . . . . . . . 96
9.4 Fit and errors of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.5 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.6 More properties of Ŷ and ˆ . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.7 Strong OLS hypotheses and testing linear hypotheses in the linear model.100
9.8 “Forecasts” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.9 a note on P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.10 Stochastic X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.11 Markowitz and the linear model (this section is not required for the exam)108
311
9.12 Some results useful for the interpretation of estimated coefficients . . . 109
10 Style analysis
169
10.1 Traditional approaches with some connection to style analysis . . . . . 174
10.2 Critiques to style analysis . . . . . . . . . . . . . . . . . . . . . . . . . 176
11 Factor models and principal components
11.1 A very short introduction to linear asset pricing models
11.2 Estimates for B and F . . . . . . . . . . . . . . . . . .
11.3 Maximum variance factors . . . . . . . . . . . . . . . .
11.4 Bad covariance and good components? . . . . . . . . .
.
.
.
.
12 Black and Litterman
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
178
178
186
191
193
194
13 Appendix: Probabilities as prices for betting
202
13.1 Betting systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.2 Probability and frequency . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.3 Probability and behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 205
14 Appendix: Numbers and Maths in Economics
205
15 Appendix: Optimal Portfolio Theory, who invented it?
207
16 Appendix: Gauss Markoff theorem
208
17 Exercises and past exams
211
18 Appendix: Some matrix algebra
214
18.1 Definition of matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
18.2 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
18.3 Rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
18.4 Some special matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
18.5 Determinants and Inverse . . . . . . . . . . . . . . . . . . . . . . . . . 215
18.6 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
18.7 Random Vectors and Matrices (see the following appendix for more details)216
18.8 Functions of Random Vectors (or Matrices) . . . . . . . . . . . . . . . . 216
18.9 Expected Values of Random Vectors . . . . . . . . . . . . . . . . . . . 216
18.10Variance Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . 216
18.11Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
18.12Derivatives of linear functions and quadratic forms . . . . . . . . . . . 217
18.13Minimization of a PD quadratic form, approximate solution of over determined linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
312
18.14Minimization of a PD quadratic form under constraints. Simple applications to Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
18.15The linear model in matrix notation . . . . . . . . . . . . . . . . . . . . 221
19 Appendix: What you cannot ignore about Probability and Statistics
222
19.1 Probability: a Language . . . . . . . . . . . . . . . . . . . . . . . . . . 226
19.2 Interpretations of Probability . . . . . . . . . . . . . . . . . . . . . . . 227
19.3 Probability and Randomness . . . . . . . . . . . . . . . . . . . . . . . . 227
19.4 Different Fields: Physics . . . . . . . . . . . . . . . . . . . . . . . . . . 228
19.5 Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
19.6 Other fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
19.7 Wrong Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
19.8 Meaning of Correct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
19.9 Events and Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
19.10Classes of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
19.11Probability as a Set Function . . . . . . . . . . . . . . . . . . . . . . . 231
19.12Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
19.13Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
19.14Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
19.15Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 232
19.16Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
19.17Properties of the PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
19.18Density and Probability Function . . . . . . . . . . . . . . . . . . . . . 233
19.19Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
19.20Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
19.21Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
19.22Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
19.23Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
19.24Tchebicev Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
19.25*Vysochanskij–Petunin Inequality . . . . . . . . . . . . . . . . . . . . . 235
19.26*Gauss Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
19.27*Cantelli One Sided Inequality . . . . . . . . . . . . . . . . . . . . . . . 236
19.28Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
19.29Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
19.30Subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
19.31Univariate Distributions Models . . . . . . . . . . . . . . . . . . . . . . 237
19.32Some Univariate Discrete Distributions . . . . . . . . . . . . . . . . . . 237
19.33Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 238
19.34Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 238
19.35Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 238
313
19.36Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.37Distribution Function for a Random Vector . . . . . . . . . . . . . . . .
19.38Density and Probability Function . . . . . . . . . . . . . . . . . . . . .
19.39Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.40Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.41Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . .
19.42Mutual Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.43Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . .
19.44Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . .
19.45Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . .
19.46Law of Iterated Expectations . . . . . . . . . . . . . . . . . . . . . . .
19.47Regressive Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.48Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . .
19.49Distribution of the max and the min for independent random variables
19.50Distribution of the max and the min for independent random variables
19.51Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.52Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.53Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.54Why Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.55Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . .
19.56Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . .
19.57No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.58No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.59Learning Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.60Pyramidal Die . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.61Pyramidal Die Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.62Pyramidal Die Constraints . . . . . . . . . . . . . . . . . . . . . . . . .
19.63Many Rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.64Probability of Observing a Sample . . . . . . . . . . . . . . . . . . . .
19.65Pre or Post Observation? . . . . . . . . . . . . . . . . . . . . . . . . . .
19.66Maximize the Probability of the Observed Sample . . . . . . . . . . . .
19.67Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.68Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.69Possibly Different Samples . . . . . . . . . . . . . . . . . . . . . . . . .
19.70The Probability of Our Sample . . . . . . . . . . . . . . . . . . . . . .
19.71The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . .
19.72The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . .
19.73The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . .
314
238
239
239
239
239
240
240
241
241
241
241
242
242
243
243
244
244
244
245
245
246
246
247
247
248
248
249
249
249
250
250
250
251
251
251
252
252
252
19.74The Estimate in Other Possible Samples . . . . . . . .
19.75The Estimate in Other Possible Samples . . . . . . . .
19.76Sampling Variability . . . . . . . . . . . . . . . . . . .
19.77Sampling Variability . . . . . . . . . . . . . . . . . . .
19.78Sampling Variability . . . . . . . . . . . . . . . . . . .
19.79Sampling Variability . . . . . . . . . . . . . . . . . . .
19.80Estimated Sampling Variability . . . . . . . . . . . . .
19.81Quantifying Sampling Variability . . . . . . . . . . . .
19.82Principle 1 . . . . . . . . . . . . . . . . . . . . . . . . .
19.83Principle 2 . . . . . . . . . . . . . . . . . . . . . . . . .
19.84Principle 3 . . . . . . . . . . . . . . . . . . . . . . . . .
19.85The Questions of Statistics . . . . . . . . . . . . . . . .
19.86Statistical Model . . . . . . . . . . . . . . . . . . . . .
19.87Specification of a Parametric Model . . . . . . . . . . .
19.88Statistic . . . . . . . . . . . . . . . . . . . . . . . . . .
19.89Parametric Inference . . . . . . . . . . . . . . . . . . .
19.90Different Inferential Tools . . . . . . . . . . . . . . . .
19.91Point Estimation . . . . . . . . . . . . . . . . . . . . .
19.92Unbiasedness . . . . . . . . . . . . . . . . . . . . . . .
19.93Mean Square Error . . . . . . . . . . . . . . . . . . . .
19.94Mean Square Efficiency . . . . . . . . . . . . . . . . . .
19.95Meaning of Efficiency . . . . . . . . . . . . . . . . . . .
19.96Mean Square Consistency . . . . . . . . . . . . . . . .
19.97Mean Square Consistency . . . . . . . . . . . . . . . .
19.98Methods for Building Estimates . . . . . . . . . . . . .
19.99Method of Moments . . . . . . . . . . . . . . . . . . .
19.100Estimation of Moments . . . . . . . . . . . . . . . . . .
19.101Inverting the Moment Equation . . . . . . . . . . . . .
19.102Problems . . . . . . . . . . . . . . . . . . . . . . . . . .
19.103Maximum Likelihood . . . . . . . . . . . . . . . . . . .
19.104Maximum Likelihood . . . . . . . . . . . . . . . . . . .
19.105Interpretation . . . . . . . . . . . . . . . . . . . . . . .
19.106Interpretation . . . . . . . . . . . . . . . . . . . . . . .
19.107Interpretation . . . . . . . . . . . . . . . . . . . . . . .
19.108Interpretation . . . . . . . . . . . . . . . . . . . . . . .
19.109Maximum Likelihood for Densities . . . . . . . . . . . .
19.110Example (Discrete Case) . . . . . . . . . . . . . . . . .
19.111Example Method of Moments . . . . . . . . . . . . . .
19.112Example Maximum likelihood . . . . . . . . . . . . . .
19.113More Advanced Topics . . . . . . . . . . . . . . . . . .
19.114Sampling Standard Deviation and Confidence Intervals
315
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
252
253
253
254
254
254
255
255
255
256
256
256
257
257
257
257
258
258
258
258
259
259
259
260
260
260
260
261
261
261
262
262
262
263
263
263
263
263
264
264
264
19.115Sampling Variance of the Mean . . . .
19.116Estimation of the Sampling Variance .
19.117nσ Rules . . . . . . . . . . . . . . . . .
19.118Confidence Intervals . . . . . . . . . .
19.119Confidence Intervals . . . . . . . . . .
19.120Confidence Intervals . . . . . . . . . .
19.121Hypothesis testing . . . . . . . . . . .
19.122Parametric Hypothesis . . . . . . . . .
19.123Two Hypotheses . . . . . . . . . . . . .
19.124Simple and Composite . . . . . . . . .
19.125Example . . . . . . . . . . . . . . . . .
19.126Critical Region, Acceptance Region . .
19.127Errors of First and Second Kind . . . .
19.128Power Function and Size of the Errors
19.129Testing Strategy . . . . . . . . . . . . .
19.130Asymmetry . . . . . . . . . . . . . . .
19.131Some Tests . . . . . . . . . . . . . . .
19.132Some Tests . . . . . . . . . . . . . . .
19.133Some Tests . . . . . . . . . . . . . . .
19.134Some Tests . . . . . . . . . . . . . . .
19.135Some Tests . . . . . . . . . . . . . . .
19.136Some Tests . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
265
265
265
266
266
266
266
267
267
267
268
268
268
269
269
269
269
270
270
270
270
271
20 *Taylor formula in finance (not for the exam)
20.1 *Taylor’s theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2 *Remainder term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.3 *Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4 *Taylor formula and Taylor series . . . . . . . . . . . . . . . . . . . . .
20.5 *Taylor formula for functions of several variables . . . . . . . . . . . . .
20.6 *Simple examples of Taylor formula and Taylor theorem in quantitative
Economics and Finance . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.7 *Linear and log returns, a reconsideration . . . . . . . . . . . . . . . .
20.8 *Taylor theorem and the connection between linear and log returns . .
20.9 *How big is the error? . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.10*Gordon model and Campbell-Shiller approximation. . . . . . . . . . .
20.11*Remainder term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.12*Dividend price model . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.13*What happens if we take the remainder seriously . . . . . . . . . . . .
20.14*Cochrane toy model. . . . . . . . . . . . . . . . . . . . . . . . . . . .
272
272
272
273
274
274
21 *Appendix: Some further info about the use of regression models
306
316
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
275
276
279
279
280
285
285
287
289
Download