Handouts for the course ”Financial Econometrics and Empirical Finance Module I” Francesco Corielli August 26, 2020 It is perhaps not to be wondered at, since fortune is ever changing her course and time is infinite, that the same incidents should occur many times, spontaneously. For, if the multitude of elements is unlimited, fortune has in the abundance of her material an ample provider of coincidences; and if, on the other hand, there is a limited number of elements from which events are interwoven, the same things must happen many times, being brought to pass by the same agencies. Plutarch, Parallel Lives, Life of Sertorius. To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. Ronald Fisher Il est vrai que M. Fourier avait l’opinion que le but principal des matématiques était l’utilité publique et l’explication des phénomènes naturels; mais un philosophe comme lui aurait du savoir que le but unique de la science, c’est l’honneur de l’esprit humain, et que sous ce titre, une question de nombres vaut autait q’une question du système du monde. Carl Gustav Jacobi (letter to Adrien-Marie Legendre, from Konigsberg, July 2nd, 1830) 1 Among the first examples of least squares: Roger Cotes with Robert Smith, ed., “Harmonia mensurarum”, (Cambridge, England: 1722), chapter: “Aestimatio errorum in mixta mathesis per variationes partium trianguli plani et sphaerici”, pag. 22. An hypothesis should be made explicit: no systematic bias in the measuring instruments. Thomas Simpson points this out in: “An attempt to shew the advantage arising by taking the mean of a number of observations in practical astronomy” (From: “Miscellaneous Tracts on Some Curious Subjects ...”, London, 1757). In modern terms this shall become E(|X) = 0 (see section 9). 2 Introduction This course aims to offer students a selection of probabilistic and statistical applications compiled according to a twofold criterion: they should require the introduction of only few new statistical tools and make use as much as possible of what the student should already know; they should be, as far as possible given the introductory level of the course, "real world" tools, that is: they should represent simplified versions of tools really, and sometimes heavily, applied in the markets. The course also aims to achieve a much more difficult task: trying to show how probabilistic and statistical thinking can actually be useful for understanding and surviving markets. Historical experience tells us that the first aim is achieved for the vast majority students, and this is enough for getting a good grade at the end of the course. About the more ambitious aim: it shall be for each student in this course how much the study of this topic was useful for the student’s professional life. This is a Master course and it presumes some knowledge of Probability and Statistics from BA courses. Being a course for a Master in Finance in Bocconi University, the kind of previous knowledge which is given for granted is based on the related BA programs in Bocconi. These program are similar to the programs for BAs in Economics and Finance in most Universities around the world. In any case, for student coming from other Universities, the syllabus and the preliminary course program should give a clear idea of which notions shall be given for granted in this course. The teachers of this course are fully available to help any student with suggestions for readings which could be useful to complete preliminary knowledge. Experience tells us that, not infrequently, students do not remember at once everything they indeed did study (and maybe get a good grade for) during their BA. However, that memory of past knowledge easily comes back when necessary. For an even better understanding of what is given for granted (Statistics and Probability wise) a summary of the relevant definitions and theory is added as an appendix to these handouts (appendix 19). This is not to be intended as a standalone text in basic Statistics and Probability: all students should read this in order to be able to check their level of knowledge in basic Statistics and ask for help to the teachers in case of problems. Basic notions of matrix algebra are also required and are summarized in sections 6 and 7 and in appendix 18. The main theoretical tools introduced in this course which may be new for most students are: 3 • some non parametric Statistics useful for value at risk computations • an introductory but rather complete treatment of the multivariate linear model • principal components analysis Using these tools, and more basic notions of Probability and Statistics, the course describes applications in the fields of: • value at risk • factorial risk models for asset allocation • style analysis • efficient portfolio analysis • performance evaluation • Black and Litterman asset allocation Most of the examples and applications described in the course shall make use of stock market data. This choice is not due to an assumed relevance of this market (it constitutes a rather small fraction of the overall financial market) but to the fact that dealing with more relevant markets (interest rates, exchange rates, commodities etc.) requires institutional and technical knowledge that cannot be assumed at disposal of the course students. In any case, with the proper changes, most of what discussed in this course can be and is applied in almost all kinds of financial markets. A note on this version of the handouts This is a further revised version of the course handouts. This is work in progress and, hopefully, it shall remain so. Old errors and typos are corrected but new one still may creep in. I’ll be grateful for any suggestion, correction, comment and I am grateful for the many past suggestions and corrections by students and colleagues. These handouts receive an update each year in September at the beginning of the course. In the September 2019 update the only changes of some import are: a rewriting of the example in section 9.12.11 and of the first two subsections of chapter 11 joint with some relabeling and renumbering of chapter 11 subsections. The September 2020 update only corrects some errors and reformats some formulas. 4 This text was and is used for several different courses. Hence, a number of sections and subsections of these handouts are not required for the 20191 course exam. This is detailed in the course syllabus and specified in the text when necessary. In particular, Chapter 20, which was part of the course in academic year 2014/2015 and chapter 21 are no more in the program of the 20191 course. Probability, Statistics and Finance Modern Finance studies are broadly divided in two (interconnected) fields: asset pricing and Corporate Finance. Asset pricing studies the observed statistics on financial securities prices (“price” is here broadly intended as any kind of data concerning the value of an asset, could be a price or a rate, a yield or an exchange rate etc). The aim of the study is twofold. From the empirical point of view, we wish to characterize and summarize the joint distribution of observed prices and its evolution. This is of paramount practical importance in the field of asset management and risk management. Due to the quality, detail and observational frequency, financial data do require (and allow) more advanced statistical techniques than those required for dealing with standard economic data. On the other hand, for the same reason, the practical usefulness of the results obtained from these data is in most cases greater by several degrees of magnitude, compared with what we can do with standard economics problem. An empirical trace of this is in the fact that Finance is the only applied field of Economics where rules, laws, conduct codes etc are mostly written in terms of mathematical and statistical models. From the theoretical point of view, the aim is to connect the observed financial data with the overall evolution of the economic system. In this, Finance has most in common with Economics. Finance, however, can exploit, in many cases, stronger connections between observable (that is: prices) that standard Economics. Most financial assets of interest are traded in fairly liquid and ordered markets. For this reason traders must, as a rule, at least approximately avoid/exploit simple arbitrage trades. The avoidance of arbitrage, if implemented by a relevant part of market agents (and this may not be the case) imposes strong functional constraints to the price system. Sometimes it is even possible to find exact or almost exact functional relationships among security prices (“arbitrage pricing”) which are largely independent on a detailed modelling of agents behaviours. Only when stronger results are required, for instance results connecting financial prices with the general behaviour of an economy and, most difficult, when we look for 5 the financial implications of economic policy acts, the study of Finance must resort to the usual Economics practice of deriving price systems from hypotheses on agents behaviour. This is a very difficult effort. A fortunate fact is that, for most practical market applications, this, at least at a first level of approximation, can be avoided. Corporate Finance, in its origins, mostly deals with the capital structure of a firm as connected with its investments. The original problem was that of characterizing those investments which could be valuable for the firm and how to finance these1 . During time, further problems were considered by the field, mainly problems concerned with organizational and managerial issues. How to choose and compensate management, how to implement strategies which are viable for the stakeholder, when and why to reorganize the firm, and so on. Due to these developments, modern Corporate Finance has much in common with organization theory and, in particular with the field of industrial organization. Most research in the field of Corporate Finance has a strong empirical twist being concerned on the consequences of financing, organizational and investment decision on the ”value” of the firm. As a consequence, huge databases containing corporate data have been build in the last 20 years and the statistical applications to corporate study have grown to and amazing size. The Reader should notice that by “applications” here it is not intended “academic applications”, any Corporate Finance act: issue of new stock, bond issues, IPOs, M&A operations, investment planning etc is strongly based an a quantitative analysis of the relevant data. For these reasons there is not much need for justifying the presence of (several) courses in applied Probability and Statistics within a graduate curriculum in Finance. A very direct proof could be a visit to any trading floor, participation to meetings for an M&E operation, spending some time in the risk management office of a bank or simply a reading of laws and regulations concerning the management of financial companies. The recent emphasis on “big data”, when not debased to a fad, further confirms the central role of data analysis and data modelling in the genera financial field. Another possible way to understand the role of quantitative methods in Finance (and in financial regulation) could be browsing through the program of institutional exams required in order to relate with clients in international markets. Among these, just consider the FINRA Registered Representative levels in the USA. But the simplest and most direct way, at least in our opinion, is to point out the fact that most of Finance has to do with deciding today prices for economic entities It could be interesting to notice that, until recently, what exactly is a “firm”, why it should exist and in which aspects a “firm” in economic theory is similar to what is called “firm” in common language, has not been really clear. See on this point the introduction of Jean Tirole, (1988), “The theory of industrial organization”, MIT Press. 1 6 whose precise future value is unknown. In such a field, the availability of a language for speaking about uncertain events is, obviously, a necessity. Up to present time, the most successful language devised with such a purpose is Probability Theory (competitors exist but lag far behind both in popularity and practical effectiveness). It is interesting to notice that the language of Probability Theory, intimately conjugate to the statement of prices for betting on uncertain events, is more directly appropriate for dealing with uncertainty in the field of Finance, where bets are actually made, than, say, in Physics, where the problem is not (at least at a prima facie level) that of betting on uncertain results but, maybe, that of describing the long term frequencies (a very non-empirical concept) of experimental results2 . Probability and Statistics Financial contracts bearing strong similarities to modern contracts were traded well before Hammurabi code was sculpted (in fact an article of the code deals with a particular option contract). Similar contracts were the basis for shipping ventures in classical Athens and lated Rome. Such contracts were quite common and were priced without problems by anciant financiers. Well, non exactly without problems as many examples of such contracts came to us in the writings of famous lawyers/orators like Demostenes. The use of Probability in Finance, however, is much more recent and can be seen as a direct offspring of the classical origin of Probability in the context of gambling. However, gambling problems are usually much easier to deal with than, say, security pricing problems. The reason for this is that in most "games of chance" two elements are usually agreed upon. First, the nature of the game is such that the probabilities of its results are agreed upon by the vast majority of participants, usually via symmetry arguments (equal probability of each side of a coin, or of drawing any given card, etc.). Second, in typical situations the betting decisions of players do not change the probability of results (we are speaking about games of chance the like of roulette trentequarante, rouge et noir, dice games and the like. In most card games the element of chance is in the card dealing and is then mediated by a strategic element in the card play phase. This makes analysis much more complex). The consequence of the first point is that, in typical games of chance, Statistics, as a tool for choosing probabilities, is not required (while it could be required in other betting settings, the like, e.g., of horse racing, football matches etc.). See the appendix on pag. 202 for a summary of a definition of probability based on betting systems and and its connection with frequencies. 2 7 Probability theory has noting to say about the "right" probabilities to assign to possible events. Its field is the consistency (no arbitrage) among probability statements, whose numerical values do not origin in Probability Theory itself (except for obvious cases like the probability of the sure or impossible event) and, to a lesser degree, the interpretation of such statements. As mentioned, the basic input required by Probability theory, namely the probabilities of simple events, are agreed upon in most games of chance, where almost everybody agree on the validity of simple symmetry arguments from which numerical values of probabilities are derived3 . Maybe these symmetry arguments are justified by some putative set of past observations, however, the agreement is so widespread that it could be possible, if still wrong, to tag as "wrong" probability assessments which disagree with the majority’s. In this sense an inference tool for deriving probability estimates from, say, past data, is not directly required for gambling4 . In words we are used to when considering financial risk management, we could say that in chance games there is no model and estimation risk: numerical values of probabilities for possible events can well be considered as given and “correct”, at least in the sense that almost everybody agrees on these. This is not true in the Finance milieu. In the case of financial prices, simple symmetry arguments are, as a rule, not applicable and, for this reason, we are often in need for estimation of such values using past observations and models. In other words, we need Statistics, with the implied estimation risk. Moreover the statistical models we use are not “common knowledge”. They are not derived by simple, agreed upon, symmetry arguments like those used in most games of chance. This “uncertainty” about models is called “model risk”. Let us pass to the second point: independence of probabilities on strategic behaviour of players. The fact that, say, the future price of an asset is directly dependent on the bets made on it by traders (non necessarily in any ordered or “rational” way) is a mainstay of Finance (and Economics). It embodies the complex interaction between judgment of values and expectations of prices which, through the concept of equilibrium, is both the stuff of economic theory and of day by day work in the markets. This interaction by itself contributes in determining the probabilities of market events, probabilities which cannot at any time be considered as "given" as those in typical games of chance. This interaction is not only complex, but also subject to change and usually does not satisfy symmetry arguments so that it cannot, with the exception of very simple A great probabilist: Laplace, believed that this way of computing probability by symmetry was, or should be, possible in any sensible application of the concept of probability, excluding in this way, for instance, the application of Probability to horse racing betting. 4 It must be said that some Statistics is actually used for periodical checking of the "randomness" of chance generating engines used in gambling (the like of roulette, fortune’s wheel or dice) 3 8 contexts, be ignored (as we ignore, e.g. in the rolling of a die, the complex but stable and symmetry justifiable physical model of its chaotic rebounds on a hard surface) and solved with a simple symmetry induced probability model (for the die: each side has 1/6 probability to turn up). Consider the hypothetical case of a die where each face has a weight depending, in an unknown way, by the amount of money gamblers play on it where gamblers tend to choose their bets on the basis of numerological considerations depending on their humour and, maybe, of observed past result in the rolling of the same die. Arguably, different analysts shall choose different models for this, and the different choices shall have consequences on the behaviour of market agents. This complexity is another source of “model risk” typical of the study of Finance. Models, moreover, shall contain unknown quantities: “parameters”, and we shall need to estimate these. Any estimate, based on rules of thumb or on best Statistics, shall imply a possible error, this is called “estimation risk” and the main difference between rules of thumb and good Statistics is the ability of the second of quantifying such risk. In the end, this is what makes the field of Finance so different (and much more difficult) than the study of games of chance. We do not only run risks implied by the fact that results of bets are uncertain, we are also uncertain on how to model such risk (model risk) and we need data in order to estimate parameters in the models we decide to use (estimation risk). In principle it is possible to separate the two aspects that make financial markets and gambling casinos different. It is possible, and, in fact, it has been done in the past (for instance by Yahoo Finance) to create fictional markets where "stocks" with absolutely no economic meaning are "traded" between agents and future prices of these stocks are determined (typically through an auction system) by the amounts "invested" in them by players. It is interesting to notice that in such contexts, where the "true value" of each share is in fact known to be 0 and the aim of the game is only that of moving ahead of the flock, prices follow paths which are very similar, qualitatively, to those observed in real financial markets. This should be instructive for understanding how, even when the traded securities have a real (if numerically not known) economic meaning and value, the simple interaction of agents in the market can create an evolution of prices partially independent on such value5 . Economists hope that such “partial independence” is not too strong. In fact, financial markets have the relevant role of allocating investments among different economic endeavors in some “efficient” way, where “efficient” should mean, roughly, that investments with better prospects should receive, at least on average, more money. The question is whether markets are a setting in which this happens or whether the market induced “noise” can overwhelm any value “signal”. The history of market crises contains a rich set of clues about an answer to this question and we can say that, at least in some cases, the answer may be in the negative (but we are at a loss if we are asked for some systems different than market and able to be “right” at least as frequently). Notice that even in casino games, where the “value” of each game (the probability of each possible 5 9 Ultimately, the decision of how much to bet on a given future scenario requires both an assessment of economic value AND an evaluation of the consequences of the interacting opinions of agents. This is a very difficult task, which, as we said above, cannot rely on simple symmetry arguments of the like used to "agree on probabilities" in standard games of chances. Tools are required for economic evaluation and tools are required for connecting past observations of market and, more in general, economic events to the statement of probabilities useful for choosing among actions whose results depend on future events. In the financial field this makes Probability intimately connected with Statistics and more in general with Economics. A Caveat. From what we wrote here it could be deduced that the business of Probability and Statistics in Finance is that of forecasting future prices. If by this is meant forecasting the exact value of a future price, this would be a wrong deduction. Instead, if by the term forecasting we intend the assessment of probability distributions for future prices, we get a clearer picture of what we intend to do. Standard introductions to Finance theory stress this point by describing in a very simplified way an investment decision as a choice among risk/expected returns pairs. More advanced analysis describe investment as a choice among probability distributions of future returns. Antidotes to delusions There is another relevant reason for the study of Probability and Statistics during a financial oriented training. Financial markets are “full of intrinsic randomness” in the sense that since we do not possess, and with all probability shall never possess, result times the payoff of each result) is known, frequencies of bets fluctuate, together with players whims, and it may be possible that, on some gambling tables and for some numbers or colors, we observe a concentration of bets which is totally unjustified by some anomalous probability of the coveted result but that may, all the same, last even for considerable time. (The Reader should think about The huge literature on “late numbers” in bingo, lotto and similar games). As we wrote above, financial markets are casinos with the added twist that probabilities of outcomes are not known and that outcome themselves depend on the opinions and hopes of players. The resulting mess is, then, fully understandable, we do not like this but, alas, a better tool (or, at least not worse) for allocating investments among uncertain prospects is still to be discovered! To conclude this footnote on a positive tone we must say that the study of modern financial markets has an advantage wrt the study of modern economic systems. While irrational, arbitrage ridden, behaviour is possible in both settings, at least in “normal” times modern financial markets tend to punish arbitrage allowing investors in such a quick and (financially) harmful way that propensity for a coherent assessment of one’s own personal bets is a strong point (we repeat: in normal times) of most big investors. In more general economic situations, such punishment is not so quick, it can typically be made to burden on the decision maker’s offspring or on other people hence it does not bind too much decisions. In other words the financial setting is a privileged setting in Economics at least because we can assume that most of times the agent behaviour may be stupid but not irrational. 10 tools which allow us to forecast the future with precision, we must learn to live in an environment of unresolvable uncertainty. The human mind does not seem to adapt well to environments of this kind. Each time the future value of variables is relevant for us but we are not able to determine it, either by forecasting or by direct intervention, our brain, which craves for stable patterns, shall, if uncontrolled, tend to create such patterns out of nothing and be fooled into believing in illusions (there exists an immense literature on gambling behavior and perception fallacies which substantiates this statement). This explains at least a subset of observed irrational behaviours of investors. Statistics and Probability are also relevant because they can be seen as antidotes to such delusions6 . They may not make us right most of times and it may quite be that some lucky dumbo shall enjoy results better than ours. But, at least, they may help us in not being upset if something which a priori we considered likely does indeed happen or in preventing us when we may wish to change our decision rule due to events which confirm the optimality of such rule. Maybe this is not much, but in the not too long run it may count for much. With the help of such tools we may understand, for instance, how, given the amount of variance in the market ant the huge number of dumbo investors, the fact some investor of this kind shall be better off than us “technically learned investors” is so likely to be almost sure and, for this reason, this fact should not upset us or induce us in dumbo behaviours. There exists a classic literature on the topics of gambling, luck and delusions. It goes back at the very least to Plutarch but traces can be seen in the Bible itself. It is not peculiar of this literature the fact that its principles are time after time “rediscovered” by well meaning while perhaps not sufficiently well read Authors. It is peculiar of it the fact that such rediscovery always surprises the readers as if it presented them with new ideas. This tells us much about our inability of learning how to deal with uncertainty. No matter how many times delusions of luck are explained, they are bound to come back again. I suggest interested students to read a pair of quite engaging classics in the field: “Chance and Luck” by the astronomer Richard Proctor (1887) and “Extraordinary Mass Delusions and the Madness of Crowds” by Charles Mackay (1841) both available for free on www.gutenberg.org. Quantitative methods as legal disclaimer In recent years, first, in the US and then all over the world, a new role of quantitative models has surfaced and is becoming pervasive. That of legal disclaimer. Only partial antidotes: at some time in the future everybody shall fell to the lure of “randomness deciphering”. Anecdotes where the best statisticians “fell for it” pepper introductory and expository books on Statistics and Probability. 6 11 Suppose you are an asset manager and your client is not satisfied by your results. The client is going to question you about what did you do in order to get such results. Maybe such questioning could take the form of a legal procedure where you are sued for malpractice. This is common to most professions, just consider the medical profession as a classic instance. Answering to such actions requires some definition of “right” practice, which, in most professional fields, is very difficult to state (in principle, if such definition was really possible, it would take the form of a computer program which could take the place of the involved professionals). In most fields, think again to the medical field, this has taken the shape of “protocols of intervention” which should define the lines of action a professional should follow when dealing with a case. This was undoubtedly a useful development in some field, but was also accompanied by a huge increase of the bureaucratic aspect of any profession an, what is potentially more dangerous, it did imply and increase of rigidity of behaviour when dealing with situations where following the protocol is potentially dangerous for the client/patient. In these cases the “protocol following” behaviour is useful for the professional, which could defend on this basis, but could be fatal for the client. The classic solution to this problem, stated by Shakespeare in one of his most quoted verses: Henry VI, Part 2, Act IV, Scene 2. "The first thing we do, let’s kill all the lawyers", may seem a little drastic. In Finance, in particular in the asset management field, the establishment of best practices and protocols takes the form, in most circumstance, of quantitative models of asset selection and evaluation which, in order to get an easy legal validity, are mostly based on standard academic arguments. It much easier to defend yourself in a court of justice by saying: “the asset allocation model we follow is *** and is based on published research by ***. The model gave a small probability of negative results which, alas, did actually happen”, than saying: “I choose the asset allocation on the basis of my gut feeling and past experience. It could go well, it could go bad. Though luck: it did go bad”. While there may be not much difference between the asset allocations, the first defense is certainly stronger than the second. This attitude is so widespread that, in some fields of Finance, it has become law: just think about the risk management rules derived from the sequel of Basel agreements. Just like in the field of medicine, such developments had some positive consequences (beyond the disclaimer effect) in that they require discipline and clarity of analysis on the part of agents. They also contributed to spreading a “formal” use of quantitative models which, when not continuously questioned and updated, could increase the risk of decision taken “because the model says so” just in view of the legal disclaimer a model may offer. 12 Whatever may be your evaluation of such a development, this is the world of finance you are going to enter. If you want to be clever agents in this work you need to understand it and understanding passes (also!) from the understanding of the quantitative tools applied for quantifying and for taking risks. Required Probability and Statistics concepts. Sections 1-5. In the first 5 sections of these handouts only very simple concept from Statistics and Probability are required. The most relevant are as follows: expected value, variance, standard deviation, correlation, statistical independence. Moments, quantiles. Binomial distribution. Gaussian distribution. Sampling variability. Tchebicev inequality. Confidence intervals. Hypothesis testing. These should be already known to the Readers from their BA courses, a very short summary is available in section19 of these notes and is a required part of this course. For a quick check of the basic points see: 19.21, 19.22, 19.23, 19.41, 19.42, 19.48, 19.28, 19.29, 19.32, 19.34, 19.35, from 19.76 to 19.81, 19.114, 19.115, 19.116, from 19.118 to 19.136, 19.24. Beyond definitions and basic properties, the most important point to have clear in mind is the differences and the connections between Probability and a Statistics concepts. For a quick check go to 19 1 1.1 Returns Return definitions There is a love story with returns in Finance: while prices are the financially relevant quantities (what we pay and what we get), we often speak and write models using returns. It is true that, for one period models, there is substantially no difference in considering a change in price and a return (a difference vs a percentage difference) as the initial price is assumed known. However, returns, while useful, can be tricky in multi period models or when using time series data for estimating parameters in single period models. So, they must be well understood. 13 Returns come in two versions. Let Pit be the price of the i − th stock at time t. The linear or simple return (in the future we shall deal with dividends and with total returns) between times tj and tj−1 is defined as: ritj = Pitj /Pitj−1 − 1 The log return is defined as: rit∗ j = ln(Pitj /Pitj−1 ) In both these definitions of return we do not consider possible dividends. There exist corresponding definitions of total return where, in the case a dividend Dj is accrued between times tj−1 and tj , the numerator of both ratios becomes Ptj + Dj . Moreover, here we do not apply any accrual convention to our returns, that is: we just consider period returns and do not transform these on a, say, yearly basis. It is to be noticed that, while Ptj means “price at time tj ”, rtj is a shorthand for “return between time tj−1 and tj ” so that the notation is not really complete and its interpretation depends on the context. When needed, for clarity sake, we shall specify returns as indexed by the beginning and the end point of the time interval in which they are computed as, for instance, in rtj−1 ;tj . The two definitions of return yield different numbers, for the same prices, when the ratio between consecutive prices is far from 1. Consider the Taylor formula for ln(x) for x near to 1: ln(x) = ln(1) + (x − 1)/1 − (x − 1)2 /2 + ... if we truncate the series at the first order term we have: ln(x) ∼ =0+x−1 so that if x is the ratio between consecutive prices we have that for x near one the two definitions give similar values. It is also clear that ln(x) ≤ x − 1. In fact x − 1 is equal to and tangent to ln(x) in x = 1 and above it everywhere else (the second derivative of ln(x) is negative while it should change sign for ln(x) to cross x − 1 before or after the tangency point). This implies that, if one kind of return is used and mistaken for the other, the approximation errors shall be all of the same sign. We also see that the size of the error is increasing roughly as (x − 1)2 . In Finance the ratio of consecutive prices (sometimes called “total return” and maybe corrected by taking into account accruals) is often modeled as a random variable with an expected value very near 1. This implies that the two definitions shall give different values with sizable probability only when the variance (or more in general the dispersion) of the price ratio distribution is non negligible, so that observations 14 far from the expected value have non negligible probability. Since standard models in Finance assume that variance of returns increases when the time between prices for which the return is computed increases, this implies that the two definitions shall more likely imply different values when applied to long term returns. Why two definitions? The corresponding prices are the same and this implies that both definitions, if not swapped by error, give us the same information. The point is that each definitions is useful, in the sense of making computations simple, in different cases. From now on, for simplicity, let us only consider times t and t − 1. Let the value of a buy and hold portfolio, composed of k stocks each for a quantity ni , at time t be: X ni Pit i=1..k It is easy to see that the linear return of the portfolio shall be a linear function of the returns of each stock. P X ni Pit ni Pit P −1= −1 rt = P i=1..k j=1..k nj Pjt−1 j=1..k nj Pjt−1 i=1..k = X P i=1..k = X i=1..k wit (rit + 1) − 1 = ( X i=1..k ni Pit−1 Pit −1= j=1..k nj Pjt−1 Pit−1 wit rit + X wit 1) − 1 = i=1..k X i=1..k wit rit + 1 − 1 = X wit rit i=1..k Where wit = P ni Pit−1 are non negative “weights” summing to 1 which represent j=1..k nj Pjt−1 the percentage of the portfolio invested in the i-th stock at time t − 1. This simple result is very useful. Suppose, for instance, that you know at time t − 1 the expected values for the returns between time t − 1 and t. Since the expected value is a linear operator (the expected value of a sum is the sum of the expected values, moreover additive and multiplicative constants can be taken out of the expected value) and the weights wit are known, hence non stochastic, at time t−1 we can easily compute the return for the portfolio as: X E(rt ) = wit E(rit ) i=1..k Moreover if we know all the covariances between rit and rjt (if i = j we simply have a variance) we can find the variance of the portfolio return as: X X V (rt ) = wi wj Cov(rit ; rjt ) i=1..k j=1..k 15 For log returns this is not so easy. In fact we have: rt∗ P X X ni Pit nP Pit P i it−1 ) = ln( ) = ln( = ln( P i=1..k wit exp(rit∗ )) n P n P P j=1..k j jt−1 j=1..k j jt−1 it−1 i=1..k i=1..k The log return of the portfolio is not a linear function of the log (and also the linear) returns of the components. In this case assumptions on the expected values and covariances of the returns of each security in the portfolio cannot be (easily) translated into assumptions on the expected value and the variance of the portfolio return by simple use of basic “expected value of the sum” and “variance of the sum” formulas. Think how difficult this could make to perform any standard portfolio optimization procedure as, for instance, the Markowitz mean/variance model. Before going on with log returns we stress again an important point. All computations given here above suppose that prices at time t − 1 are known, that is: non stochastic, moreover, we suppose the investment is not changing between t − 1 and t. If this were not so, we could not make passages as: X wit E(rit ) E(rt ) = i=1..k We should be satisfied by the almost useless X E(wit rit ) E(rt ) = i=1..k This because wit is a function of prices at time t − 1which would now be stochastic, and, even if the price at time t − 1 were known, the change of investment between time t − 1 and t would make the weight and t, as seen from t − 1, stochastic. The same problem for the computation of variance. Stochastic Pt−1 , and/or a change of strategy, then, destroy the possibility of recovering the expected value and the variance of the portfolio linear return from the expected values and the variances and covariances of the linear returns of the individual securities. Now, log returns. These are much easier to use than linear returns when we aim at describing the evolution of the price of a single security thru time. Suppose we observe the price Pti at time t1 , ...tn the log return between t1 and tn shall be: rt∗1 ,tn = ln Y Pt X Pt P t Ptn i = ln n n−1 = ... = ln = rt∗i Pt1 Ptn−1 Pt1 P i=2...n ti−1 i=2...n 16 It is then easy, for instance, given the expected values and the covariances of the sub period returns to compute the expected value and the variance of the full period return (from t1 to tm ). On the other hand, this is not true for the linear returns. We have: rt1 ,tn = Y Pt Y Pt P t Ptn i − 1 = n n−1 − 1 = ... = −1= (rti + 1) − 1 Pt1 Ptn−1 Pt1 P t i−1 i=2...n i=2...n In general the expected value of a product is difficult to evaluate and does not depend only on the expected values of the terms. A noticeable special case is that of non correlation among terms. For the computation of the variance, the problem is even worse. It is clear that, when problems involving the modeling of portfolio evolution over time are considered, we shall often see promiscuous and inexact use of the the two definitions. You should keep in mind that standard "introductory" portfolio allocation models are one period models, when we are considering multi-period portfolio models the returns may be more difficult to use than simple prices. To sum up: the two definition of returns yield different values when the ratio between consecutive prices is not equal to 1. The linear definition works very well for portfolios over a single period and conditional to the knowledge of prices at time t − 1: expected values and variances of portfolios can be derived by expected values variances and covariances of the components, as the portfolio linear return over a time period is a linear combination of the returns of the portfolio components. For analogous reasons the log definition works very well for single securities over time. We conclude this section with three warnings. These should be obvious but experience teaches the opposite. First. Many other definitions of return exist and each one origins from either traditional accounting behavior (and typically is connected with some specific asset class) or from specific computational needs. These are usually based on linear returns but use different conventions for computing the number of days between two prices and the accrual of possible dividends and coupons. Second. No single definition is the “correct” or the “wrong” one. In fact such a statement has no meaning. The correctness in the use of a definition depends on the context in which it is applied (accounting uses are to be satisfied) and, obviously, on avoiding naive errors the like of exponentiating linear returns for deriving prices or summing exponential returns over different securities in order to get portfolio returns. For instance: the fact that, for a price ratio near to 1, the two definitions give similar values should not induce the reader in the following consideration: “if I break a sizable period of time in many short sub periods, such that prices in consecutive times are likely to be very similar, I am going to make a very small error if I use, say, the linear return in the accrual formula for the log return”. This is wrong: in any single sub 17 period the error is going to be small, but, as mentioned above, this error has always the same sign, so that it shall sum up and not cancel, and on the full time interval the total error shall be the same non matter how many sub periods we consider. Third: this multiplicity of definitions requires that, when we speak about any properties of “returns”, it should be made clear which return definition we have in mind. For instance: the expected value of log returns must not be confused with the expected value of linear returns. The probability distribution of log returns shall not be the same as the probability distribution of linear returns, and so on. Practitioners are very precise in specifying such definitions in financial contracts, the common imprecision in financial newspapers can be justified in view of the descriptive purposes of these. The same precision is not always found in academic papers. 18 19 08 ‐0,6 ‐0,4 ‐0,2 0 0,2 0,4 0,6 0 0,2 0,4 0,6 0,8 1 r and r* as functions of Pt/Pt‐1 1,2 1,4 1,6 rr* r 1.2 Price and return data Finance is “full of numbers”, price data and related Statistics are gathered for commercial and institutional reasons and are readily available on free and commercial databases. This has been true for many years and, for some relevant market, databases have been reconstructed back to nineteen century and in some case even before. As in any field where data are so overwhelmingly available and not directly created by the researcher thru experiments, any researcher must be cautious before using them and follow at least some very simple rules which could be summarized in the sentence: “KNOW YOUR DATA BEFORE USING IT!”. What does the number mean? How was it recorded? Did it always mean the same thing? These are three very simple questions which should get an answer before any analysis is attempted. Failure to do so could taint results to such a way as to make them irrelevant or even ridiculous. Avoid any oversimplifying position the like of the surprising one (if you consider the usual quality of thought) by Schumpeter quoted in section 14. This is not the place for a detailed discussion but it could be useful for us to try and analyze a very simple example. Suppose you wish to answer the following question: “how did the US stock market behave during its history”. You browse the Internet and run a search for literature on the topic. Suppose you are able to shunt off conspiracy theorists, finance fanatics, quack doctors and serpent oil sellers, Ponzi scheme advertising and the like. let us say that you concentrate on academic and academic linked literature (which by no means assures you to avoid the peculiar positions just listed). At the onset you could be puzzled by the fact that, in the overwhelming majority of papers and books, the performance of markets where thousands of securities, not always the same, are traded and traded in different historical moments and under different institutional rules, is summarized by a single number, and index. For the moment we do not consider this point. You find a whole jungle of academic and non academic references among which you choose two frequently quoted expository books by famous academicians: “Irrational exuberance” by Robert J. Shiller (of Yale) and “Stocks for the long run” By Jeremy J. Siegel (of Wharton)7 . You browse through the first chapter of both and find Figure 1-1 of Siegel which tells you that 1 dollar invested in stock in 1802 would have become 7,500,000 dollars by 1997. Moreover you read that 1 dollar of 1802 is equivalent (according to Siegel) to 12 dollars in 1997. The real return should have been of about 625000 times in real terms (62,500,000% !) On the other hand Figure 1.1 of Shiller’s book gives the following information: The connection between the two authors and the two books is clearly stated by Shiller in his Acknowledgments. 7 20 between 1871 and 2000 the S&P composite index corrected by inflation grew from (roughly) 70 to (roughly) 1400 with a real return of roughly 20 times (2000%). Both numbers are big, but also quite different. Now you are puzzled. Sure: a part of the difference is due to the different time basis. Looking to Siegel picture you see that the dollar value of the investment around 1870 was about 200, even exaggerating inflation, attributing the full 12 times devaluation to the 1870-2000 period, and assessing this 200 to be worth 2400 1997 dollars, we would have a real increase of 3125 times which is still more than 150 times Shiller number. This, obviously, cannot come from the difference in terminal years of the sample as the period 1997-2000 was a bull market period and should reduce, not increase, the difference. Now, both Authors are famous Finance professors and at least one of them (Shiller) is one of the gurus of the present crisis. So the problem must be in the reader (us). Let us try and improve our understanding by reading the details. First we notice that Siegel quotes as source for the raw data the Cowles series as reprinted in Shiller book “Market volatility” for the 1871-1926 period and the CRISP data for the following period, while Shiller speaks about the S&P composite index. Reading with care we see another difference: Shiller speaks about a “price” index while Siegel about a reinvested dividends total return index. Is this the trick? Browsing the Internet we see that Shiller’s data are actually available for downloading (http://www.econ.yale.edu/∼shiller/data.htm). We can compute the total return for Shiller data between 1871 and 1997 and the real increase now is from 1 dollar to 3654 dollars in real terms. We also see that the CPI passed from 12 to 154 in the same time interval so that the “12 times” rule for the value of the dollar used by Siegel seems a good approximation8 . There is still some disagreement between the numbers (Siegel 3125, but with exaggerated inflation, and Shiller 3654) but we think that, at least for answering our question, we have enough understanding. In this very short and summary analysis we did learn some important things. First: understand your question. “How did the US market behave during its history” is, now we understand, not quite a well specified question. Are we looking for a summary of the history of prices, or for the history of one dollar invested in the market? The two different questions have two different answers and require different data. Second: understand your data. Price data? Total return data? raw or inflation Beware of long term inflation indexes. The underlying hypotesis is that the basket of consumption goods be comparable thru time. As an anedoctal hint of the contrary: a very good riding horse in 1865 could cost 200 dollars, a “comparable status “ car costs, today, 50.000 dollars. If we, quite heroically, compare the two “goods” due to their use we se an increase in price not of 12 times but of 250 times. If we use the “12 times” rule, we get 2400 dollars which might be the price of a scooter. Which is the right comparison? 8 21 corrected? There are many subtle but relevant points that should be made, we only mention the Survivorship Bias problem which taints the ex post use of financial series. But we stop here for the moment and do not mention the fact that a lot of discussion has run about the relevance of the questions and of the answers and their interpretation. The fact is: Siegel and Shiller start with similar data but they reach quite different conclusions (at least, this is their opinion on their work). We can reconcile the data: we understand they are using the same data in two different ways. However, why each of them draw a different conclusion and, moreover, why they “agree to disagree”? 1.3 Some empirical “facts” While we realize if not fully understand these differences of opinion, this could be the right place to state several empirical “facts” that underlie much of the discussion about the long run behaviour of US stock market. We do this with the yearly Shiller dataset (widely used in academic literature). We shall concentrate on the total log return series. The dataset starts with 1887 and is updated each year since in the latest available version Shiller uses the dataset up to 2013 included we shall limit our computations to the interval 1871-2013. During this interval the average real log total return of the index was 6.33%. In the same period the average real one year interest rate was 1.03% so that the so called risk premium was about 5.3%. The standard deviation of the real log total return was 17.09% while the same statistic for one year real interest rates was 6.54%. The 5.3% average real log total return in excess to the yearly rate (which was even higher up to year 2000) compared with the 17.09% standard deviation (even smaller than this up to 2000) did generate a literature concerned with the “equity premium puzzle”. The average of the real dividend yield (up to 2011 only) is 4.45% and the standard deviation of the same is 1.5%. The average real log price return was 2.16% and the standard deviation of the same 17.68%. While we can only approximately sum these two result and compare them with the total real log return, we see that most of the equity premium is associated with the dividend y Notice that the correlation coefficient between real dividend yield and real log price return is .10 (positive but small) this explains why the standard deviation of the total real log return is even smaller than the sum of the standard deviations of log real price return and real dividend yield. On the other hand this small correlation is, by itself, a puzzle. 22 A last piece of simple data analysis: the 1 year autocorrelation of the real total log return series is very small: 2.29. This is a first simple evidence of the fact that is is very difficult to forecast future returns on the basis of past returns. Some of these empirical facts are at the basis of the simple stock price evolution model we shall introduce in the next chapter. 23 24 FIGURE 1-1 Total Nominal Return Indexes, 1802-1997 25 Examples Exercise 1a - returns.xls Exercise 1b - returns.xls 2 Logarithmic random walk The (naive) log random walk (LRW) hypothesis on the evolution of prices states that, if we abstract from dividends and accruals, prices evolve approximately according to the stochastic difference equation: ln Pt = ln Pt−∆ + t where the “innovations” t are assumed to be uncorrelated across time (cov(t ; t0 ) = 0 ∀t 6= t0 ), with constant expected value µ∆ and constant variance σ 2 ∆. Sometimes, a further hypothesis is added and the t are assumed to be jointly normally distributed. In this case the assumption of non correlation becomes equivalent to the assumption of independence. Since ln Pt − ln Pt−∆ = rt∗ the LRW is obviously equivalent to the assumption that log returns are uncorrelated random variables with constant expected value and variance. A specific probability distribution for t is not required at this introductory level. It is, however, the case that, often, the log random walk hypothesis is presented from scratch assuming t to be Gaussian, or normal. Notice that from the model assumptions ∗ we have: Pt = Pt−∆ ert = Pt−∆ et so, if t is assumed Gaussian, Pt shall be lognormally distributed. A linear (that is: without logs) random walk in prices was sometimes considered in the earliest times of quantitative financial research, but it does not seem a good model for prices since a sequence of negative innovations may result in negative prices. Moreover, while the hypothesis of constant variance for (log) returns may be a good first order approximation of what we observe in markets, the same hypothesis for prices is not empirically sound: in general price changes tend to have a variance which is an increasing function of the price level. A couple of points to stress. First: ∆ is the “fraction of time” over which the return is defined. This may be expressed in any unit of time measurement: ∆ = 1 may mean one year, one month, one day, at the choice of the user. However, care must be taken so that µ and σ 2 are assigned consistently with the choice of the unit of measurement of ∆. In fact µ and σ 2 represent expected value and variance of log return over an horizon of Length ∆ = 1 and they shall be completely different if 1 means, say, one year (as usually it does) or one day (see below for a particular convention in translating the values of µ and σ 2 between different units of measurement of time which is one of the consequences of the log random walk model). 26 Second: suppose the model is valid for a time interval of ∆ and consider what happens over a time span of, say, 2∆. By simply composing the model twice we have: ln Pt = ln Pt−2∆ + t + t−∆ = ln Pt−2∆ + ut having set ut = t + t−∆ . The model appears similar to the single ∆ one and in fact it is but it must be noticed that the ut while uncorrelated (due to the hypothesis on the t ) on a time span of 2∆ shall indeed be correlated on a time span of ∆. This means, roughly, that the log random walk model can be aggregated over time if we “drop” the observations (just one in this case) in between each aggregated interval (in our example the model shall be valid if we drop every other original observation). This is going to be relevant in what follows. The LRW was a traditional standard model for the evolution of stock prices. It is obviously a wrong model, if understood as stating that prices are dictated by “chance”. It can be considered as a good descriptive model in the sense that its success depends not on its interpretation of the actual process of price creation (it would fail miserably) but on its consistency with observed “large scale” (i.e. ∆ non too small) statistical properties of prices. Consistency is measured by comparing probabilities of events as given by the model with observed frequencies of such events. Even from this point of view, while the model is not dramatically wrong and still useful for introductory and simple purposes, the weight of empirical analysis during the last thirty years has led most researchers to consider this hypothesis as a very approximate description of stock price behavior. While no consensus has been reached on an alternative standard model, there is a general agreement about the fact that some sort of (very weak) dependence exists for today’s returns on the full, or at least recent, history of returns. Moreover, the constancy of the expected value and variance of the innovation term has been strongly put under questioning. In any case, the LRW still underlies many conventions regarding the presentation of market Statistics. Moreover the LRW is perhaps the most important justification for the commonly held equivalence between the intuitive term "volatility" and the statistical entity "variance" (or better "standard deviation"). An important example of this concerns the “annualization” of expected value and variance. We are used to the fact that, often, the rate of return of an investment over a given time period is reported in an “annualized” way. The precise conversion from a period rate to a yearly rate depends on accrual conventions. For instance, for an investment of less that one year length, the most frequent convention is to multiply the period rate times the ratio between the (properly measured according to the relevant accrual conventions) Length of one year and the Length of the investment. So, for instance, 27 if we have an investments which lasts three months and yields a rate of 1% in these three months, the rate on an yearly basis shall be 4%. It is clear that this is just a convention: the rate for an investment of one year in length shall NOT, in general, be equal to 4%, this is just the annualized rate for our three months investment. This shall be true, for instance, if the term structure of interest rates is constant. However such a convention can be useful for comparison across investment horizons. In a similar way, when we speak of the expected return or the standard deviation/variance of an investment it is common the report number in an annualized way even if we speak of returns for periods of less or of more than one year. The actual annualization procedure is base on a convention which is very similar to the one used in the case of interest rates. As in this case the convention is “true”, that is: annualized values of expected value and variance correspond to per annum expected values and variances, only in particular cases. The specific particular case on which the convention used in practice is based is the LRW hypothesis. If we assume the LRW and consider a sequence of n log returns rt∗ at times t, t − 1, t − 2, ..., t − n + 1 (just for the sake of simplicity in notation we suppose each time interval ∆ to be of length 1 and drop the generic ∆) we have that: X X ∗ ∗ ∗ E(rt−n,t ) = E( rt−i )= E(rt−i ) = nµ i=0,...,n−1 ∗ V ar(rt−n,t ) = V ar( X i=0,...,n−1 ∗ rt−i )= i=0,...,n−1 X ∗ V ar(rt−i ) = nσ 2 i=0,...,n−1 This obvious result, which is a direct consequence of the assumption of constant expected value and variance and of non correlation of innovations at different times, is typically applied, for annualization purposes, also when the LRW is not considered to be valid. So, for instance, given an evaluation of σ 2 on daily data, this evaluation is annualized multiplying it, say, by 256 (or any number representing open market days, different ones exist), it is put on a monthly basis by multiplying it by, say, 25 and on a weekly basis by multiplying it by, usually, 5. As we stressed before, this is not just a convention, but the correct procedure, if the LRW model holds. In this case, in fact, the variance over n time periods is equal to n times the variance over one time period. If the LRW model is not believed to hold, for instance: if the expected value and-or the variance of returns is not constant over time or if we have correlation among the t , this procedure shall be applied but just as a convention.9 Empirical computation of variances over different time intervals typically result in sequences which tend to increase less than linearly wrt the increase of the time interval between consecutive observations. This could be interpreted as the existence of (small) on average negative correlations between returns. 9 28 The fact that, under the LRW, the expected value grows linearly with the length of the time period while the standard deviation (square root of the variance) grows with the square root of the number of observations, has created a lot of discussion about the existence of some time horizon beyond which it is always proper to hold a stock portfolio. This problem, conventionally called “time diversification”, and more popularly “stocks for the long run”, has been discussed at length both on the positive (commonly sustained by fund managers) and the negative side (more rooted in academia: Paul Samuelson is a non negligible opponent of the idea) we shall consider it in the next section. To have an idea of the empirical implications of the LRW hypothesis (plus that of Gaussian distribution) for returns we plot in the following figures an aggregated index of the US stock market in the 20th century together with 100 simulations describing possible alternate histories of the US market in the same period, under the hypothesis that the index evolution follows a LRW with yearly expected value and standard deviation of log return identical with the historical average and standard deviation: resp. 5.36% and 18,1% (the use of %, as usual with log returns, is quite improper if common). Data is presented both in price scale (starting value 100) and in log price scale. The reason is simple. Consider the distribution of log return after 100 year under our hypothesis. This is going to be the distribution of the sum of 100 iid Gaussian RV each with expected value of 5.36% and standard deviation 18.1%, Using known results we have that this distribution shall be Gaussian with expected value 536% and standard deviation 181%. So, a standard ±2σ interval for the terminal value of this sum 5.36±3.62 is 536%±362%, or, in price terms, 100e. that is an interval with lower extreme 569 and upper extreme 794263. This means that under our hypotheses the possible histories can be quite different. No problem in this if we recall the unconditional nature of the model. To get a quick idea: the actual evolution of the market as measured by our index gave a final value equal to about 21000 which correspond, as said, to a sum of log returns of 536%. This is, by construction, smack in the middle of the distribution of the summed log returns and is the median of the price distribution. However, due to the exponentiation, or if you prefer, due to the power of compound interest, the distribution of final values is highly asymmetric (it is Lognormal) so that the range of possible values above the median of prices is much bigger than that below it. We only simulated 100 possible histories. Even with such a limited sample we have a top terminal price of more than 2000000 (in a very lucky, for long investors, world. We wonder what studying Finance would be in such a world...) and a bottom terminal price below 100 (again: in a world so unlucky that, had we lived in it, we likely would not talk about the stock market)10 . Compare this with the Siegel-Shiller data we discussed in section 1, then think about the result of our simulation in such extreme worlds. For instance, with the historical mean and standard deviation of the extreme depressed version 20th century the simulation I would show you in this possible world, 10 29 This result could be puzzling as the “possible histories” seem very heterogeneous. This is an immediate consequence of the log random walk hypothesis. If we estimate µ and σ out of a long series of data (one century) we are using data from a very heterogeneous set of economic and historical conditions. Then we use this number in order to simulate “possible histories” without conditioning to any particular evolution of historical or economical variables which could and shall influence the stock price. In other words: we are using the log random walk model as a “marginal” model. That is: it is unconditional to everything you may know or suppose about the evolution of other variables connected with the evolution of the modeled stock price. This point is quite relevant if we wish to understand the sometimes surprising implications of this simple model. In the above example, according to the model and the historically estimated parameters, we get the ±2σ interval 536%±362% (beware the % sign: these are log 5.36±3.62 returns), or, in price terms, 100e. that is an interval with lower extreme 569 and 11 upper extreme 794263 . It must be clear that such a wide set of histories is possible, with non negligible probability, only because we did assume nothing on the (century long) evolution of all the variables that shall influence prices. Only under this “ignorance” assumption such an heterogeneous set of trajectories can have non negligible probability. If we are puzzled by the result this is because, while the model describes the possible evolution of prices “in whatever conditions”, unconditional to anything (in fact, we estimate expected return and standard deviation using a long history, during which many different things happened), when we see the implications of the model we, almost invariably, shall be conditioned by our recent memories and recall recent events or, unconsciously, shall make some hypothesis on the future as, for instance, the fact that provided you an I were still interested in this topic, would be quite different that what you see here. And all the same, this possible story is a result totally compatible (under Gaussian LRW) with what we did actually see in our real history. Spend a little time thinking about this point. It could be “illuminating”. Think also to the economic sustainability of such extreme worlds: such extreme market behaviours cannot happen by themselves (this is not the plot of some lucky or unlucky casino guy, it is the market value of an economy, which should sustain such values, provided investors are not totally bum) and how they could be so absurd just because they underline the possible absurd extreme conclusions we can derive from a simple LRW model. Last but not least, remember that all this comes from the analysis of the stock market in a very, up to now, successful country: the USA. But we analyze it so much also because it was successful (and so, for instance, most Finance schools, journals and researchers are USA based. This biases our conclusions if we wish to apply such conclusions to the rest of the world or, even, to the future of USA. Maybe a more balanced view could be gained by comparing this result with the evolution of stock markets all around the world (this is not a new idea, Robert J. Barro, for instance did this in “Rare Disasters and Asset Markets in the Twentieth Century.” (2006) Quarterly Journal of Economics, 121(3): 823–66.) 11 By the way: this should be enough to understand why we should not use the term % when speaking of log returns. 30 economic growth shall be, on average, similar to what recently seen. Since the estimates of µand σ we use (or even our assumption of zero correlation of log returns and, more in general, the structure of the model itself which contains no other variables but a single price) are NOT conditional to such (implicit) hypotheses it is not surprising that the model gives us such wide variation bounds with respect to what we could expect. This misunderstanding is quite common and it is to be always kept in mind when discussing results of the applications of the log random walk model12 . There exists a wide body of literature, both from the applied and the academic sides, that suggests ways for “conditioning” the model. This shall not be considered in this course, however in appendix 20 we consider a simple version of a possible conditional model. 12 31 32 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 0 20 40 60 80 100 years of simulated log random walk data 100 simulated paths (mean log return 5.35% dev.st. 18.1%) 100 120 33 0 10000 20000 30000 40000 50000 0 20 40 60 80 100 years of simulated log random walk data (range subset) compared with USA stock market in the 20th century (mean log return 5.35% dev.st. 18.1%) 100 120 34 1 10 100 1000 10000 100000 1000000 10000000 0 20 40 60 80 100 years of simulated log random walk data Log scale compared with USA stock market in the 20th century (mean log return 5.35% dev.st. 18.1%) 100 120 2.1 "Stocks for the long run" and time diversification These are very interesting and popular topics, part of the lore of the financial milieu. A short discussion shall be useful to clarify some issues connected with the LRW hypothesis together with some implicit assumption underlying much financial advertising. We have three flavors of the “stocks for the long run” argument. The first and the second are a priori arguments depending on the log random walk hypothesis or something equivalent to it, the third is an a posteriori argument based on historical data. It is quite important to have a clear idea of the different weight and meaning of these arguments. In fact, most of the “puzzling” statements you may find about advertising “stock for the long run” by the investment industry depend on a wrong “mix” of the arguments.13 2.1.1 First version The basic idea of the first version of the argument can be sketched as follows. Suppose single period (log) returns have (positive) expected value µ and variance σ 2 . Moreover, suppose for simplicity that the investor requires a Sharpe ratio of say S out of his-her investment. Under the above hypotheses, plus the log random walk hypothesis, the Sharpe ratio over n time periods is given by √ µ nµ √ = n σ nσ so that, if n is big enough, any required value can be reached. Another way of phrasing the same argument, when we add the hypothesis of normality on returns, is that, if we choose any probability α, the probability of the investment to yield a n periods return greater than √ nµ − nz1−α σ is equal to 1 − α. But this, for √ n> 1 z1−α σ 2 µ is an increasing and unbounded (above) function, so that for any α and any chosen value C, there exists a n such that from that n onward, the probability for an n period return less than C is less than α. The investment suggestion could be: if your time horizon is of an undetermined number n of years, then choose the investment that has the highest expected return per unit of standard deviation, even if the standard deviation is very high. Even if this As an example of rather clever misunderstandings, ument “Time Diversification and Horizon-Based Asset http://www.vanguard.com/pdf/icrtd.pdf?2210045172 13 35 read the Vanguard docAllocations” available at investment may seem too risky in the "short run" there is always a time horizon so that for that horizon, the probability of any given loss is as small as you like or, that is the same, the Sharpe ratio as big as you like. Typically, such high return (and high volatility) investment are stocks, so: "stocks for the long run". Notice, however, that the value of n for which this lower bound crosses a given C level is the solution of √ nµ − nz1−α σ ≥ C In particular, for C = 0 the solution is √ z1−α σ n≥ µ With the typical stock the σ/µ ratio for one year is of the order of about 6. So, even allowing for a big α, so that z1−α is near one (check by yourself the corresponding α), the required n shall be in the range of 36 which is only slightly shorter than the average working life. It is important to understand that, by itself, we cannot judge such a statement as correct or wrong. This investment suggestion is or is not reasonable depending on the investor’s criterion of choice. This, for instance, could be the full period expected return given some probability of a given loss, or the Sharpe ratio for the full n periods or, for instance, the per period Sharpe ratio (which obviously is a constant) or, again, the absolute volatility over the full period of investment (which obviously increases without bounds), and so on. For instance, a typical critique to the statement is phrased like this: "Why should we consider as proper a given investment for n time periods if we do not consider it proper for each single one of those periods?" This critique is correct if we believe that the investor takes into account the per period Sharpe ratio or some measure of probable loss and expected return per period. In other words the critique is correct if, very reasonably, we believe the investor does not consider equivalent investments with identical sharp ratios but over different time spans. Another frequent critique is: "It is true: the expected value of the investment increases without bounds but so does its volatility so, in the end, over the long run I am, in absolute terms, much more uncertain on my investment result" (the meanstandard deviation ratio goes up only because the numerator grows faster than the denominator). This is reasonable as a critique if we believe the investor to decide on the basis of the absolute volatility of the investment over the full time period. We should also point out that to choose a single asset class only because, by itself, it has the highest Sharpe ratio, should always be criticized on the basis of diversification arguments. In the end, acceptance or refusal, on an a priori basis, of this argument depend on how we model the investor’s decision making. However, in general, it cannot be labeled 36 as “wrong”: there may be a point in it, at the least if you are a very peculiar kind of investor. 2.1.2 Second version The second version of the argument, again based on the log random walk hypothesis, is a real fallacy (that is: it is impossible to justify it in any reasonable way) and is the so called "time diversification" argument. There is an enticing similitude, under the log random walk hypothesis, between an investment for one year in, say 10, uncorrelated securities with identical expected returns and volatilities (this last hypothesis is just for simplicity: the argument can be extended to different expected returns and volatilities), and a 10 year investment in a single security with the same expected value and volatility. To be precise, in order for the result to hold we must forget the difference between linear and log returns, moreover the comparison implicitly requires zero interest rates. But let’s do it: such an “approximate” way of thinking is very common in any field where some Mathematics is used for practical purposes and it is a sound way to proceed provided the user is able to understand the cases where his-her “approximations” do not work. In this case, the expected return and standard deviation for the return corresponding to the first strategy (which could be tagged as the "average per security" return) are µ and √σn , just the same as the expected value and standard deviation for the "average per year" return of the second strategy. We should be wary from the beginning in accepting such comparisons: in fact the two investments cannot be directly compared since they are investments of the same amount but on different time periods. Moreover, but this is not independent from the previous comment, the comparison is based on the flawed idea that the expected return and variance of the first investment, can be compared with the average per year expected return and variance of the second investment. In fact, while the expected return and variance of the first investment are properties of an effective return distribution (that is the distribution of a return which I could effectively derive from an investment) the average expected return and variance of the second investment are not properties of a return which I could derive from the second investment. All that I can derive from the second investment is the distribution of returns over the ten years period which, obviously, has ten times the expected value and ten times the variance than the distribution of the average return (which, we stress again, is not the return I could get by the investment). So, no time diversification exists but only a wrong comparison between different investments using different notions of returns. Comparable investments could be a ten year investment in the diversified portfolio 37 and a ten year investment in the single security and a possible correct comparison criterion could be the comparison between the ten year expected return and return variance of the two investments. However, in this case the diversified investment is seen to yield the same expected value of the undiversified investment but with one tenth of the variance so that, these two investments, now comparable, are by no means equivalent and the single security investment is seen, in the mean variance sense, as an inferior investment. Analogously, we could ask which investment on a single security over ten years has the same return mean and variance as the one year diversified investment. The obvious answer is: an investment of one tenth the size of the diversified investment. In other words: in order to have the same effective (that is: you can get it from an investment) return distribution the two investments must be non only on different time periods but also of different sizes. While the first version of the argument could be argued for, at least under some hypothetical, maybe unlikely but coherent, setting, this second version of the argument is a true fallacy. 2.1.3 Third version The third version of the stocks for the long run argument is the soundest as it can be argued for without being liable of unlikely assumptions or even blatant logical errors. It is to be noticed that this third version is not an a-priori argument, based on assumptions concerning the stochastic behavior of prices and the decision model of agents (and, maybe some logical error). Instead, it is an "a posteriori" or "historical" version of the argument. As such its acceptance or rejection entirely depends on the way we study historical data. In short this argument states that, based on the analysis of historical prices, stocks were always, or at least quite frequently, a good long run investment. Being an historical argument, even if true (and here is not the place to argue for or against this point) this it does not imply that the past behavior should replicate in the future. The following figure from “Stocks for the long run” summarizes Siegel’s argument: on the horizontal axis we see the holding periods, on the vertical axis the up and down bars plot the maximum and minimul holding period total return (in real term) for all possible holding periods of the same lenght. This is plotted for stocks, bonds (medium/long term govies) and t-bills (short term govies). Returns are expressed in average per year terms. As it can be seen (and as is quite obvious) the range between best and worst average returns shrinks as the holding period increases. What is less obvious is that the worst mean return for stocks is less negative than the corresponding value for bonds and bills beginning with the 13 years holding period onward while the best mean return is always higher. 38 Actually, the figure could be misleading as it compares returns for different securities on different time periods. Much more interesting could be a figure which compares the spreads over the same periods of investments in stocks VS bonds, bonds VS bills and stocks VS bills and displays, of these three kind of investments, best average and worst result for each holding period. A possible help can be derived by another figure in “stocks for the long run”. In this picture we see tha standard deviation of the average return for different holding periods (this time, supposedly, non overlapping) of investments in the tree securities. These are compared with the theoretical standard deviation for the investment in stocks derived from the hypothesis of iid returns log (our log rendom walk). As commonly√observed, the actual standard deviation for mean returns decreases faster than 1/ n: the order implied by the random walk hypothesis and this may imply a slight negative correlation between returns which becomes relevant for long holding periods. Both these are empirical facts. Many different interpretations are possible and, as already stated above, it is perhaps better 39 to consider them just as historical facts and without trying to deduce “general rules” or “models” which, while easily adapted to past date, would most likely be a very fragile statement on possible futures. While apparently held by the majority of financial journalists (provided they do not weight too much, say, the last 30 years of prices in Japan or the last 10 to 15 years for most of the rest of the world), and broadly popular in trouble free times (at least as popular as the, historically false, argument about real estate as the most sure, if not the best, investment), and so quite popular for most time periods, at least in the USA and during the first thirty and the last fifty years of the past century, this argument is quite controversial among researchers. The two very famous and quite readable books we quoted in the chapter about returns: Robert Shiller’s "Irrational Exuberance" vs Jeremy Siegel’s "Stocks for the Long Run" share (sic!) opposite views on the topic (derived, as we hinted at but do not have the time to fully discuss, from different readings of the same data). While not the place for discussing the point, we would suggest the reader, just for the sake of amusement, to consider a basic fault of such "in the long run it was ..." arguments. 40 We have a typical example of the case where the fact itself of considering the argument, or even the phenomenon itself to which the argument applies, depends of the fact that the phenomenon itself happened, that is: something "was good in the long run". In fact we could doubt about the possibility for an institution (the stock market) which survives in the modern form, at least in the USA, since, say, the second half of nineteenth century, to survive up to today, without at least giving a sustainable impression of offering some opportunities. Such arguments, if not accompanied by something else to sustain them, become somewhat empty as could be the analogue to being surprised observing that the dish I most frequently eat, is also among those I like the most or, more in the extreme, that old people did non die young or, again, that wen we are in one of many queues, we spend most time in the slowest queue. Sometimes, however, the "opportunity" of some institution and how to connect this with its survival can manifest in strange, revealing ways. For instance, games of chance exist from unmemorable time with the only "long run" property of making the bank holder richer, together with the occasional random lucky player. The overall population of players is made, as a whole, poorer. So, while it is clear here what is the "opportunity" of this institution (both for the, usually, steadily enriched bank holder and the available, albeit unlikely, hope of a quick enrichment), the survival of such an institution based on such opportunities tells us something interesting about man’s mind. We shall get into this topic time and again in what follows (while we won’t be able to analize it in full). This should not puzzle the reader as it is the bitter bread and butter of any research field where we decide to use Probability and Statistics for writing and testing models, but only observational data are available and no (relevant) experiments are possible. Let us mention some of these fields: evolutionary biology, cosmology, astronomy. In all these fields we are overflowed by data (as in Finance and Economics) but data does not come from experiment and, most important, the observer is part of the dataset and observation is not independent on what is observed. A possible alternative, actually chosen by similar fields the like of history, is to abandon, or not even think it a serious possibility, the writing of models in Probability language and the testing of these with Statistics. In such fields, Statistics is still used: not as a tool for testing models but as a tool for describing historical data. Fields the like of political sciences and sociology are divided in their attitude. If we like fringe movements, there exist a minority of historians, mostly inspired by “Chicago area” new economic history or cliometrics, publishing in Economics journals and not very well considered by mainstream historians, which try to deal with history problems using Probability and Statistics (usually adapting models from Economics). On the other side, a not small number of economists believe that the mainstream atti41 tude to Economics shows an excess in the use of such tools and state that there exists useful economic knowledge which cannot be expressed in any available mathematical/probabilistic language. In extreme cases the extreme statement is made according to which only irrelevant points of Economics can be described with such tools14 . 2.2 *Some further consideration about log and linear returns The log return has many uses beyond the fact that it sums over time. Consider the following “game”. Your initial wealth is W0 . At each time t you flip a coin with probability P of head and 1 − P of tail. If head comes up Wt+1 = Wt u otherwise Wt+1 = Wt d. To fix the ideas set P = .5, u = 1.5 and d = .6. This seems a good game, at least at first Compute E(W1 /W0 |W0 ) this is equal to (u + d)/2 = 1.05. Now, compute E(W2 /W0 |W0 ) = (uu + ud + du + dd)/4 = (2.25 + .9 + .9 + .36)/4 = 1.1025 In general, we have u E(Wn /W0 |W0 ) = E(unu dn−nu ) = dn E(( )nu ) d Where nu is the (random) number of heads (on n flips). Since nu is Binomial(n, P ) we can approximate its distribution with a N (nP, nP (1− P )). Write now ( ud )nu = exp(nu ln( ud )). Using the CLT approximation nu ln( ud ) is (approximately) distributed according to N (nP ln( ud ), nP (1 − P ) ln( ud )2 ). We then have that exp(nu ln( ud )) is Log-normal with expected value exp(nP ln( ud ) + 1 nP (1 − P ) ln( ud )2 ). 2 Putting this together we find: u 1 u E(Wn /W0 |W0 ) ' dn exp(nP ln( ) + nP (1 − P ) ln( )2 ) = d 2 d u 1 u = exp(nP ln( ) + nP (1 − P ) ln( )2 + n ln(d)) d 2 d With our numbers this is exp(n ∗ 0.05229) > 1 and unbounded. This approximation is quite good even for n = 1 (1.05368) and n = 2 (1.11024). Notice that, with this computation, we are considering the “average across trajectories”, that is, e.g. for n = 2, we have 4 possible trajectories, with final results W0 times 2.25, .9, .9, .36 and we are considering the expected value across these. For some short remark about the debate on Mathematics and Economics see the appendix on pag. 205 14 42 Let us look at the game in a different way. u Wn /W0 = ( )nu dn = exp(ru nu + rd (n − nu )) d where ru = ln(u) and rd = ln(d) are the log returns corresponding to u, d. Log returns sum over time. We have, then, E(ln(Wn /W0 )|W0 ) = E(ru nu + rd (n − nu )) = n(ru P + rd (1 − P )) With our numbers E(ln(Wn /W0 )|W0 ) = −n.05268. The fact that the two results are different is, by itself, not striking: the first expected value, positive and unbounded, is the “arithmetic average” of the possible linear returns across different “histories”, the second is the average of the possible log returns of the same. We could agree that, since what has monetary meaning is not log but linear return, we should only consider the positive and not the negative result. Before further commenting this, however, there is a second, quite interesting result. Take a story of length n, at the end of this story you have more money than at the beginning iff u Wn = ( )nu dn > 1 W0 d Let us compute the probability of this event: u u u P (( )nu dn > 1) = P (nu ln( ) + n ln(d) > 0) = P (nu /n > ln(d)/ ln( )) d d d Here, again, we use the Gaussian approximation and recall that, approximately, nu /n ≈> N (P, P (1 − P )/n). With this we have: p p √ √ u u P (nu /n > ln(d)/ ln( )) = P ( n(nu /n−P )/ P (1 − P ) > n(ln(d)/ ln( )−P )/ P (1 − P )) ' d d p √ u ' 1 − Φ0,1 ( n(ln(d)/ ln( ) − P )/ P (1 − P )) d √ Using our numbers this becomes 1 − Φ0,1 ( n ∗ .1148) and this goes to 0 if n grows. In words: the probability of being on a trajectory where I “win at n” goes to 0 for n increasing. The bigger n the less likely that, if I choose randomly a trajectory of length n I win. Let us summarize: 1. In this game if n is big enough, I am almost sure to lose. 43 2. However, if I take the expected value of the percentage increase of wealth (linear return), from the beginning to n, across trajectories this is positive ans increasing unboundedly with n. How is this possible? The answer is simple: on most trajectories you lose (you see this even if n = 2: you lose in 3 out of 4 trajectories). However, IF YOU WIN YOU MAY WIN BIG: there are trajectories which have probability going to 0, but along which you win big. Since the percentage amount won along these few lucky trajectories grows with n much faster then their probability decreases, if you compute the arithmetic mean of the possible terminal results this is positive and unbounded. If a big population plays this game and the coin toss results are independent across different players, we shall see that most players are going to lose their money, while few (the fewer the bigger n, that is, the longer is the game) win big. If the game is the solo source of income of this population, we should observe a growing and extreme concentration of wealth. If the population of gamblers is finite, however, for big n the probability of observing even just 1 winner goes to 0. If the size of the population is constant and equal to N , the expected number of winners: N P (nu /n > ln(d)/ ln( ud )) quickly becomes smaller that 1. The expected log return, on the other hand, is negative and unbounded below because it only grows linearly with the number of wins and not exponentially like the linear return. The trajectories on which you “win” are the same, both if you measure your winnings in log return and linear return terms. Obviously, also the probability of each trajectory is the same as it only depends on the number of heads and tails not on how you compute the return. However, as we know, the linear return is always bigger than the log return, except in the case of zero return. This means that, on positive trajectories, the linear return shall grow much faster, and in negative trajectories it shall fall slower that log returns. This difference is such that the same (trajectory) probability, multiplied by the linear return of the trajectory and summed over trajectories, shall be positive and unbounded, while, multiplied by the log return of the same trajectory (that is: the log of one plus the linear return) and summed over trajectories, shall be negative and unbounded. Would you play this game? If you are risk neutral you should: you are only interested in the expected “monetary” (linear) return, you are not worried by the fact that the most likely result is that you lose. If you are risk averse (e.g. you have a log utility) you should not. Historical note: this example is a reworked version of the famous “St. Petersburg paradox” by Nicholas Bernoulli, considered and “solved” (it would be better to say: “understood”) by his cousin Daniel Bernoulli while working for the Czar in St. Petersburg. 44 The result is a “paradox” only in the sense that it makes clear the implications of being risk neutral. If we are puzzled by these implications, we can deduce that we are NOT, by instinct, risk neutral. We do not think it reasonable to willingly play a game where we are almost sure to lose (for big n), just because the expected win in the game is positive (and even unbounded). The “solution” by Daniel Bernoulli is the understanding of this: that we are puzzled by the “paradox” because we are not risk neutral “by instinct”. If you evaluate your win, still using the expected value, not in linear return terms (a win of k times the original sum is evaluated k) but in log return terms (a win of k times the original sum is evaluated ln(k)), you introduce a “risk averse” utility function and would not play the game. But there is more: once understood this, a deeper “paradox” becomes clear. n is greater that 1 and unbounded. On As computed above, the expected value of W W0 the other hand, the probability of being on a non losing trajectory tends to 0. This means, as stated above, that the probability that in a finite population of players at least one player wins tends to zero. If we put these three statements together we immediately understand that, even n is greater that 1 and unbounded, with probability which if the expected valueof W W0 tends to 1 with n increasing, all players shall be losers. n This means that for almost all players the realized W shall be less than 1. W0 Suppose, now, that the players do not know the expected value of the game and need to estimate it. After all, we can evaluate the expected value because we suppose we “know” the probabilities. Almost all players and, in the limit, a set of players of probability one, shall observe Wn <1, why should they believe that the expected value of this is positive and increasing W0 with n? This example shall make clear the difference between: 1) computing the expected value of the game using ALL possible (in the limit infinite) trajectories each with its probability 2) computing the average result of a FINITE set of players In 1) we consider all trajectories. Some of these, a set of small and decresing (with n increasing) probability, imply enormous winnings which “balance” the much more likely losses. In 2) we only have a finite population and, with big n the probability of observing even a single “winner” in this population goes to 0. Last point: the game we discussed is by no means strange. It is the classic recombining (multiplicative) binomial tree presented in basic option pricing courses. 45 Examples Exercise 2 - IBM random walk.xls 3 Volatility estimation In applied Finance the term “volatility” has many connected meanings. We mention here just the main three: 1. Volatility may simply mean the attitude of market prices, rates, returns etc. to change in an unpredictable and unjustified manner. This without connection to any formal definition of “change”, “unpredictable” or “unjustified”. Here volatility is tantamount chance, luck, destiny, etc. Usually the term has a negative undertone and is mainly used in bear markets. In bullish markets the term is not frequently used and it is typically changed in more “positive” synonym. A volatile bull market is “exuberant”, “tonic” or “lively”. 2. More formally, and mostly for risk managers, volatility has something to do with the standard deviation of returns and, sometimes, is estimated using historical data (hence the name “Historical Volatility”. 3. For derivative traders and frequently for risk managers “volatility” is the name of one (or more) parameters in derivative models which, under the hypotheses that make “true” the models, are connected with the standard deviation of underlying variables. However, in the understanding that these hypotheses are never valid in practice, such parameters are not estimated from historical data on the underlying variables (say, using time series of stock returns) but directly backwarded from quoted prices of derivatives, using the pricing model as fitting formula. This is in accord to the strange, but widely held and, in fact, formally justifiable, notion that models may be useful even if the hypotheses underlying them are false. This is “Implied Volatility”. In what follows we shall introduce a standard and widely applied method for estimating volatility on the basis of historical data on returns, that is, we consider the second meaning of volatility. Under the LRW hypothesis a sensible estimate of σ 2 is: X 2 ∗ rt−i − r∗ /n i=0,...,n Where r∗ is the sample mean. 46 This is the standard unbiased estimate for the variance of uncorrelated random variables with identical expected values and variances (the simple empirical variance of the data, where the denominator its taken as the actual number of observations n + 1, could be used without problems as in standard applications the sample size is quite big). Notice that each data point is given the same weight: the hypothesis is such that any new observation should improve the estimate in the same way. The log random walk would justify such an estimate. In practice, nobody uses such estimate and a common choice is the exponential smoothing estimate, while already quite old when suggested by J. P. Morgan in the RiskMetrics context, this is commonly known in the field as the RiskMetrics estimate: P i ∗2 i=0,...,n λ rt−i Vt = P i i=0,...,n λ From a statistician’s point of view this is an exponentially smoothed estimate with λ a smoothing parameter: 0 < λ < 1. Common values of the smoothing parameter are around 0.95. Users of such an estimate do not consider sensible to consider each data point equally relevant. Old observations are less relevant than new ones. Implicitly, then, while we “believe” the log random walk when “annualizing” volatility, we do not believe it when estimating volatility. Moreover it shall be noticed that, in this estimate, the sampling mean of returns does not appear. This is a choice which can be justified in two ways: first we can assume the expected return µ over a small time interval to be very small. With a non negligible variance it is quite likely that an estimate of the expected value of returns could show an higher sampling variability than its likely size and so it could create problems to the statistical stability of the variance estimate15 . Second, an estimate of the variance where the expected value is set to 0 tends to overestimate, not to underestimate, the variance (remember that variance equals the mean of squares less the squared mean. if you set the second to 0 you exaggerate the estimate). For institutional investors, traditionally long the market, this could be seen as a conservative estimate. Obviously this may not be a reasonable choice for hedged investors and derivative traders. A simple “back of the envelope” computation: say the standard deviation for stock returns over one year is in the range of 30%. Even in the simple case where data on returns are i.i.d., if we estimate the expected return over one year with the sample mean we need about 30 observations (years!) in order to reduce the sampling standard deviation of the mean to about 5.5% so to be able to estimate reliably risk premia (this is financial jargon: the expected value of return is commonly called ’risk premium’ implying some kind of APT and even if it also contains the risk free rate) of the size of at least (usual 2σ rule) 8%-10% per year (quite big indeed!). Notice that things do not improve if we use monthly or weekly or daily data (why?). It is clear that any direct approach to the estimate of risk premia is doomed to failure. A connected argument shall be considered at the end of this chapter. 15 47 The apparent truncation at n should be briefly commented. As we have just seen the standard estimate should be based on the full set of available observations. This could be applied as a convention also to the RiskMetrics estimate. On the other hand consider the fact that, e.g., a λ = 0.95 raised to the power of 256 (conventionally one year of daily data) is less than 0,000002. So, at least with daily data, to truncate n after one year of data (or even before) is substantially the same as considering the full data set. As it is well known: N X 1 − λN +1 λi = 1−λ i=0 So that (for 0 < λ < 1) X λi = 1/(1 − λ) i=0,...,∞ We can then approximate the Vt estimate as: X ∗2 Vt = (1 − λ) λi rt−i i=0,...,n In order to understand the meaning of this estimate it is useful to write it in a recursive form (this is also useful for computational purposes). We can directly check that: Vt = λVt−1 + P ∗2 λn+1 rt−n−1 P − i i i=0,...,n λ i=0,...,n λ rt∗2 In fact, since P Vt−1 = ∗2 λi rt−1−i i i=0,...,n λ i=0,...,n P We have P Vt = λ ∗2 ∗2 λi rt−1−i λn+1 rt−n−1 rt∗2 P P + − = i i i i=0,...,n λ i=0,...,n λ i=0,...,n λ i=0,...,n P ∗2 ∗2 λi+1 rt−1−i λn+1 rt−n−1 rt∗2 P P = + − = i i i i=0,...,n λ i=0,...,n λ i=0,...,n λ P P ∗2 i ∗2 rt∗2 + i=0,...,n−1 λi+1 rt−1−i i=0,...,n λ rt−i P = = P i i i=0,...,n λ i=0,...,n λ P i=0,...,n P Which is the definition of Vt . 48 For the standard range of values of λ and n the last term can be approximated with 0 . Using the approximate value of the denominator we have: 16 Vt = λVt−1 + (1 − λ)rt∗2 In practice the new estimate Vt is a weighted mean of the old estimate Vt−1 (weight λ, usually big) and of the latest squared log return (weight 1 − λ, usually small). A simple consequence of this (and of the fact that the estimate does not consider the mean return) is the following. Since the squared return is always non negative and λ is usually near one, this formula implies that Vt , even if the new return is 0, is still going to be equal to λVt−1 so that the estimated variance at most can decrease of a percentage of 1 − λ. On the other hand, it can increase, in principle, of any amount when abnormally big squared returns are observed. This implies an asymmetric behavior: following any shock, which introduces an abrupt jump in Vt , while a sequence of “normal” values for returns shall reduce the estimated value in a smoothed way, the faster the smaller is λ. The reader should remember that this behavior of estimated volatility is purely a feature of the formula used for the estimate. The use of such an estimate of σ 2 implies a disagreement with the standard version LRW hypothesis, as described above,e as it implies a time evolution of the variance of returns. The recursive formula: Vt = λVt−1 + (1 − λ)rt∗2 is the empirical analogue of an auto regressive model for the variance of returns the like of: 2 σt2 = γσt−1 + 2t which is a particular case of a class of dynamic models for conditional volatility (ARCH Auto Regressive Conditional Herteroschedastic) of considerable fortune in the econometric literature. The above discussion, involving the smoothed estimate for the return variance, is by no means just a fancy theoretical analysis or a curiosity related to RiskMetrics. It is the basis of current regulations. Here I reproduce a paragraph of the EBA (European Banking Authority) paper EBA/CP/2015/27. Article 38 Observation period 1. Where competent authorities verify that the VaR numbers are computed using an effective historical observation period of at least one year, in ∗2 λn+1 rt−n−1 16 P i i=0,...,n λ ∗2 ' (1 − λ)λn+1 rt−n−1 and for (0 < λ < 1), big n and any squared return “not too big”, this shall be approximately 0 49 accordance with point (d) of Article 365(1) of Regulation (EU) No 575/2013, competent authorities shall verify that a minimum of 250 business days is used. Where institutions use a weighting scheme in calculating their VaR, competent authorities shall verify that the weighted average time lag of the individual observations is not less than 125 business days. 2. Where, according to point (d) of Article 365(1) of Regulation (EU) No 575/2013 the calculation of the VaR is subject to an effective historical observation period of less than one year, competent authorities shall verify that the institution has in place procedures to ensure that the application of a shorter period results in daily VaR numbers greater than daily VaR numbers computed using an effective historical observation period of at least one year. The quoted Article 365(1) of Regulation (EU) No 575/2013 (On prudential requirements for credit institutions and investment firms), is as follows: Article 365 VaR and stressed VaR Calculation 1. The calculation of the value-at-risk number referred to in Article 364 shall be subject to the following requirements: (a) daily calculation of the value-at-risk number; (b) a 99th percentile, one-tailed confidence interval; (c) a 10-day holding period; (d) an effective historical observation period of at least one year except where a shorter observation period is justified by a significant upsurge in price volatility; (e) at least monthly data set updates. The institution may use value-at-risk numbers calculated according to shorter holding periods than 10 days scaled up to 10 days by an appropriate methodology that is reviewed periodically. 2. In addition, the institution shall at least weekly calculate a “stressed value-at-risk” of the current portfolio, in accordance with the requirements set out in the first paragraph, with value-at-risk model inputs calibrated to historical data from a continuous 12-month period of significant financial stress relevant to the institution’s portfolio. The choice of such historical data shall be subject to at least annual review by the institution, which shall notify the outcome to the competent authorities. EBA shall monitor the range of practices for calculating stressed value at risk and shall, in accordance with Article 16 of Regulation (EU) No 1093/2010, issue guidelines on such practices. The meaning of this rule is that, if you use the exponentially smoothed trunPN j estimate i i 1−λ cated at N so that your (daily data) weights are wt−i = λ / j=0 λ = λ 1−λN +1 , it 50 must be that the “weighted average time lag of the individual observations” that is: P N i=0 iwt−i be at least 125 (days). This, for given N requires a specific choice of λ. Notice that if N = 250 the only possible choice is λ = 1. In order to decrease the λ and respect the rule you must increase N 17 . Since for N → ∞the weighted average time lag is λ/(1 − λ) the requirement asks in any case (that is: whatever be N ) for a value λ > 125/126 = .992063. An even bigger number shall be needed for moderate N . This is much bigger than what used to be the common case in the past. The examples in the “classic” edition of the Risk Metrics Technical Document (iv edition 1996) use λ = .94 which, even with very big N , corresponds to a weighted average time lag of .94/.06 = 15.(6) by far too small according to the new rules. 3.1 Is it easier to estimate µ or σ 2 ? It is useful to end this small chapter discussing a widely hold belief, supported by some empirical result, according to which the estimation of variances (and in a lesser degree of covariances) is an easier task than the estimation of expected returns at least in the sense that the percentage standard error in the estimate shall be smaller that in the case of expected return estimation. The educated heuristics underlying such a belief are as follows18 . Consider log returns from a typical stock, let them be iid with expected value (on an yearly basis) of .07 and standard deviation .3. The usual estimate of the expected value, that is the √ arithmetic mean, shall be unbiased and with a sampling standard deviation of .3/ n where n is the number of years used in the estimation. Hence, the tratio, that is the ratio √ of the estimate with its standard error,√under these hypotheses, shall be, roughly n/4. Hence, for a t-ratio of 2 we need n = 8 that √ is n = 64 (years!). If we want a standard error equal to 1/2 the estimate we need n = 16 and n = 256. 256 years of data for a 2σ confidence data that still implies a possible error of 50% in the estimate of µ. This simple back on the envelope computation explains why we know so little about expected returns: if our a priori are correct then it is very difficult to estimate them. There could be a way out. Do not use yearly data but, say, monthly data. Alas, for log returns and under log random walk this does not work. Keep n constant and use any k sub periods per year (of length 1/k in yearly terms) This can easily be done with Excel or similar. There is also a partially explicitPsolution. In fact, N 1−λ using some algebra, we see that the required “weighted average time lag” is E(i) = i=0 iλi 1−λ N +1 = 17 N +1 +1)λ − (N1−λ . The problem becomes that of choosing λ and N such that this value is at least N +1 equal to 125. 18 This point is dicussed in many papers and book chapters. Among the most illustrious examples, see Appendix A in: Merton, R.C., 1980.”On estimating the expected return on the market: an exploratory investigation”. J. Financ. Econ. 8, 323–361. λ 1−λ 51 such that the number of observations in n years (for returns over the sub periods) is kn. The strategy could be that of estimating the sub period expected value µk = µ/k (the equality is due to the log random walk hypothesis) and then get an estimate of the yearly expected value by multiplying the monthly estimate by k. If we indicate ∗ the log returns for the sub periods, with σk2 = σ 2 /k as variance, would have: with rki V (µ̂k ) = V ( kn X ∗ /kn) = σk2 /kn = σ 2 /k 2 n rki i=1 This seems much better that before, but it is an illusion: we do not need an estimate of µk , we need an estimate of µ = kµk . We must then compute V (k µ̂k ) and this is V (k µ̂k ) = k 2 V (µ̂k ) = σ 2 /n Exactly the same as with “aggregated” data. This should not surprise us: in fact the arithmetic mean of log returns is simply given by the log of the ratio of the last to the first price divided by the required number of data points. In other words it only changes because of the denominator: n for a yearly mean and kn for a sub period of length k mean. No information is added by using sub period data, hence no improvement in the variance of the estimate. In summary: the expected return is difficult to estimate for two reasons. First: σ is expected to be much bigger than µ and the t-ratio depends on the ratio of these. Second: even if we increase the frequency of observations nothing changes for the estimate of the (yearly) µ so that its sampling variance stays the same. Now let us do a similar analysis for the variance. In order to make things simple we shall suppose that µ is known and data are Gaussian. This allows us to find quickly some useful results. The general case (unknown µ) is given below, but nothing relevant intervene when we remove the two simplifying hypotheses. At the end of this section we shall also consider the smoothed estimate. Let us compute the sampling variance of our variance estimate (known µ) and let ∗ ri be the yearly log return n X 1 1 1 (ri∗ − µ)2 ) = V ((r∗ −µ)2 ) = (E((r∗ −µ)4 )−E((r∗ −µ)2 )2 ) = (µc4 −σ 4 ) V (σ̂ ) = V ( n n n n i=1 2 Where µc4 = E((r∗ −µ)4 ) is the fourth centered moment and without further hypothesis could be any non negative constant. If the ri∗ are Gaussian we have µc4 = 3σ 4 and the resulting variance of the sampling variance is V( X (r∗ − µ)2 i i n 52 )= 2 4 σ n So that the√sampling √ standard deviation of the estimated variance shall be, with our 2 numbers, . 2σ / n. The t-ratio for this estimate shall be (in the approximation we suppose the estimate and the true variance non to differ too much) √ √ σ̂ 2 n √ ≈ .7 n 2 σ 2 √ In order to get a t-ratio equal to 2 we need n > 2/.7 and we get there with just n = 9 instead of 64 as in the expected value case. For a t-ratio of 4 or greater we need now √ n > 4/.7 and for this n = 33 suffices (instead of n = 256 for the above discussed case of the expected value). But there is much more: for estimating the variance the use of higher frequency data improves the result. Let our strategy be that of estimating yearly variance as k times the estimated variance for a sub period of length1/k in yearly terms (the prime sign is to indicate that this is a new estimate) 2 σ̂ 0 = kσ̂k2 Using the same notation and hypotheses as above we get kn ∗ X 1 1 (rki − µk )2 )= V ((rk∗ −µk )2 ) = (E((rk∗ −µk )4 )−E((rk∗ −µk )2 )2 ) = V (σ̂k2 ) = V ( kn kn kn i=1 1 2 4 2 σ4 (µkc4 − σk4 ) = σk = kn kn kn k 2 where we used the Gaussian hypothesis. Then 2 4 2 V (σ̂ 0 ) = k 2 V (σ̂k2 ) = σ kn And we see that now k is in the formula: using sub period data improves the estimate. Now the t-ratio for the variance estimates is approximately equal to 2√ √ σ̂ 0 kn √ kn ≈ .7 σ2 2 So that the use of k sub periods per year has an effect identical to that of multiplying the number of years by k. With, say, monthly data, we need less than one year (actually 9 months) for a t-ratio of 2 and slightly less that 3 years (just 33 months) so that t ratio for the variance becomes greater that 4 instead of the 33 years with yearly data19 . = We see that if we decrease the observation interval, so that the frequency of observation per unit period k increase, in the limit we get a sampling standard deviation of the variance equal to zero. This should not be taken too seriously: the log random walk model, which underlies this result, may be a good approximation for time intervals which are both not too long and not to short. Below the 1 day horizon we enter the world of intraday, trade by trade data which cannot be summarized in the simple log random walk hypothesis. 19 53 Estimating σ 2 is then easier than estimating µ for two reasons: First: the ratio (sometimes called t-ratio) between the estimate and its standard deviation is bigger than that for the expected value, whatever be the n. This comes from the empirical, and theoretical idea that expected return are much smaller than volatilities. Second: even if the first reason was not true, we still have the fact that using higher frequency data improves (dramatically) the quality of the estimate for σ 2 while it is irrelevant for the estimate of µ. As mentioned above all our formula hold for known µand Gaussian log returns. For the general case we have the following result: with i.i.d. log returns not necessarily Gaussian, for the estimate 2 S = n X (ri∗ − r̄∗ )2 /(n − 1)) i=1 we get Var(S 2 ) = µc4 σ 4 (n − 3) − n n (n − 1) which, for not too small n and a fourth centered moment not very different from the Gaussian case, gives us the same result as the above formula. Notice that in all these cases the sampling variance of the estimate of the variance (as that of the estimate of the expected value) goes to 0 with n going to infinity. Let us conclude with the case of the smoothed estimate. P We are going to use the approximation for the denominator given by: i=0,...,n λi ≈ P i i=0,...,∞ λ = 1/(1 − λ). The variance of the smoothed estimate is X X ∗2 ∗2 V (Vt ) = V ((1 − λ) λi rt−i ) = (1 − λ)2 λ2i V (rt−i )= i=0,...,n = i=0,...,n 1−λ (1 − λ)2 (µ4 − µ22 ) = 2σ 4 2 1−λ (1 + λ) where the last equality is true if the expected value is zero (as assumed in RiskMetrics) and log returns are Gaussian (and recall:1 − λ2 = (1 + λ)(1 − λ)). Here it is meaningless to compare this with the quality of the estimate for µ because this is assumed equal to zero. It is, however, interesting to compare the result with a result based on sub period of length k. Everything depends on the choice of λ for the sub periods. If we set it to λ1/k we have X ∗2 Vkt = (1 − λ1/k ) λi/k rkt−i i=0,...,kn so that, following the same steps as for V (Vt ) 54 V (Vkt ) = 1 − λ1/k σ 4 2 (1 + λ1/k ) k 2 and we have, for the estimate Vt0 = kVkt (we use the prime sign because this is different with respect to the estimate using aggregated data) V (kVkt ) = k 2 1 − λ1/k σ 4 1 − λ1/k 4 2 = 2σ 1 + λ1/k k 2 1 + λ1/k This, for 0 < λ < 1 and k > 1, is always smaller than the variance computed using only full period data (k = 1)20 . Examples Exercise 2 - volatility.xls Exercise 3 - risk premium.xls Exercise 3a - exp smoothing.xls Exercise 3b - historical and implied volatility.xls Exercise 3c - volatility.xls 4 Non Gaussian returns It can be argued that a reasonable decision maker should be interested in the probability distribution of returns implied by the strategy the decision maker chooses. This should be true even if in common academic analysis of decision under uncertainty the use of polynomial utility functions tend to overweight the role of the moments of the return distribution and in particular of the expected value and variance.21 In some cases, as for instance in the Gaussian case, the simple knowledge of expected value and variance is equivalent to the knowledge of the full probability distribution. In this case the expected value of any utility function shall only depends on the expected value and the variance of the distribution (being these the only parameters of the distribution). Another way to say the same is that, in this case, if we are interested in the probability with which a random variable X can show values less than or equal to a given value k, it is enough to possess the tables of the standard Gaussian cumulative density function and compute: Notice that, with the smoothed estimate, the sampling variance of the estimate does not go to 0 for n going to infinity. On the other hand the bigger k the smaller the sampling variance. Remember, however, as said above, that k cannot be taken as big as you wish as the log random walk hypothesis becomes untenable for very short time intervals between observations. 21 Due to linearity of the expected value, the expected value of a polynomial utility function P (that is aPlinear combinations of powers of the relevant variable) is a weighted sum of moments: E( i αi X i ) = i i αi E(X ). 20 55 k−µ ) σ Notice that, for distributions characterized by more that two parameters, as for instance a non standardized version of the T distribution, this property is obviously no more valid. It is then of real interest to find good distribution models for stock returns and, in particular, to evaluate whether the simplest and most tractable model: the Gaussian distribution, can do the job. A better understanding of the problem can be achieved if we consider that, in most applications, we are not interested in the overall fit of the Gaussian distribution to observed returns but only in the quality of fit for hot spots of the distribution, mainly tails. In Finance the biggest losses are usually connected to extreme, negative, observations (for an unhedged institutional investor). We shall see that the Gaussian distribution while being, overall, not such a bad approximation of the underlying return distribution, is not so for the extreme, say 1-2%, tails22 . When studying stock returns, we observe extreme events, mainly negative, in the order of µ minus 5 σ and more with a frequency which is incompatible with the probability of such or more negative events under the hypothesis of Gaussianity. In these evaluations µ and σ are estimates using a long record of data. While quite rare (do not be fooled by the fact that extreme events always make the news and so become memorable) such extreme events are much more frequent than should be compatible with a Gaussian calibrated on the expected value and variance of observed data. For instance, the probability of a µ − 5σ or more negative observation in a Gaussian is less than 0.00000028. Let us consider an example based on a long series of I.B.M. daily returns. Between Jan 2nd 1962 and Dec 29th 2005 the IBM daily return shows a standard deviation of 0.016423 . In this time period for 14 times the return was below −5σ (suppose a µ of 0 using the historical mean the number is even bigger). The number of observations is 11013 so the observed frequency of a −5σ event is 0.00127, that is: more than 4500 times the probability of such observations for a Gaussian with the same standard deviation! This is true for a very “mature” and “conservative” stock the like of I.B.M. Obviously, a frequency of 0.00127 is very small, but the events on which it is computed (big crashes) are those which are remembered in the history of the market. It is quite clear that, in this case, a Gaussian distribution hypothesis could imply a gross Φ( The Gaussian distribution can be a good approximation of many different unimodal distributions if we are interested (as is true in many applications of Statistics) in the behaviour of a random variable near its median. For modeling extreme events, having to do with system failures, breakdowns, crisis and similar phenomena, a totally different kind of distribution may be required. 23 (data are in excel Exercise 2- IBM random walk). 22 56 underestimation of the probability of such events. The behaviour of the empirical distribution of returns can be summarized in the motto: fat tails, thin shoulders, tall head. In other words, given a set of (typically daily) returns over a long enough time period (we need to estimate tails and this requires lots of data) we can plot the histogram of our data on the density of a Gaussian distribution with the same mean and standard deviation. What we observe is that, while overall the Gaussian interpolation of the histogram is good, if we zoom on the extreme tails: first and last two percent of data, we see that the tails of the histogram decrease at a slower rate than those of the Gaussian distribution. Moreover, toward the center of the distribution, we see how the “shoulders” of the histogram are thinner than those of the Gaussian and, correspondingly, the histogram is more peaked around the mean. The following plots are from the excel file “Exercise 4 - Non Gaussian returns”, in this worksheet we use data from May 19th 1995 to Sep 28th 2005 on the same I.B.M. series as before. The first plot compares the interpolated Histogram of empirical data with a Gaussian density with the same mean and variance as the data. You can clearly see the mentioned “fat tails, thin shoulders”. Since tails, fat or not, are tails that is: they are thin, in the second plot we focus on the extreme left tail and at this scale the difference between the empirical and the Gaussian distribution. The x axis is scaled in terms of standard deviation units (1 means 1 standard deviation) and we see that, moving to the left starting at, roughly, 2, the empirical tail is above the Gaussian tail: extreme observations are more frequent then what we would expect in a Gaussian distribution with the same mean and variance as the data. 57 58 0 0,02 0,04 0,06 0,08 0,1 0,12 0,14 0,16 0,18 0,2 Empirical VS Gaussian density. I.B.M. data Standard Gaussian density Relative frequency 59 -6 -5 -4 -3 -2 -1 Left tail empirical VS gaussian CDF. I.B.M. data 0 0,02 0 04 0,04 0,06 0,08 0,1 0,12 0 Gaussian CDF Empirical CDF Another way to compare the empirical distribution with a Gaussian model (or any model you may choose) is the Quantile-Quantile (QQ) plot. In the worksheet you find the standardized version of the plot. In order to build a standardized QQ plot from data you must first choose a comparison distribution, in our case the Gaussian. The second step is that of standardizing the data, using some estimate of the data expected value and variance. The standardized dataset is then sorted in increasing order an the observations in this dataset shall be the X coordinate in the plot. For each observation of the standardized returns dataset, compute the relative frequency with which smaller than or equal values were observed. Compute then, using some software version of the standard Gaussian CDF tables, the value of the standard Gaussian which leaves on its left exactly the same probability as the relative frequency left on its left by the X data, this shall be the corresponding Y coordinate in the plot. 60 61 -10 -8 -6 -4 -2 -8 -6 -4 -2 0 2 4 6 8 10 0 2 4 6 Quantile Quantile Plot. I.B.M. data. 8 Standardized sorted returns Standard Gaussian equivalent obs In the end what you see is a curve of coordinates X,Y. If the curve is a bisecting straight line, your empirical CDF is approximated well by a Gaussian CDF. Departures from the bisecting line are hints of possible non Gaussianity. To facilitate the reading of the plot, a bisecting line is added to the picture. In a second, equivalent, version of the plot the X coordinate is the same but on the Y axis we plot the difference between the Y as computed for the previous plot and the bisecting line, This is called a “detrended” QQ plot. For the I.B.M. data we see how, on the left tail, observed data are above the diagonal meaning a left tail heavier than the Gaussian. On the opposite side of the plot we see how the QQ plot lies below the bisecting line. Again: this means that we are observing data far from the mean and on the right side of it with an higher frequency than compatible with the Gaussian hypothesis. Since the data are standardized, the scale of the plot is in terms of number of standard deviations. We see that, on the left tail, we even observe data near and beyond -6 times the standard deviation. The tail from minus infinity to -6 times the standard deviation contains a probability of the order of 5 divided by one billion for the standard Gaussian distribution. We also observe 10 data points on the leftmost −5σ tail. Since our dataset is based on 10 years of data, roughly 2600 observations, if we read our data as the result on independent extractions from the same Gaussian, these observations, while possible, are by no means expected as the probability of observing 10 times, in 2600 independent draws, something which has in each draw a probability of 0.00000028 to be observed is virtually 024 . We can also follow a different, strongly related, line of thought. We see that in this dataset made of about 2600 daily observations we observe a extreme negative return of around −8σ. This is the most negative return hence the minimum observed value. Now let us ask the following question: what is the probability of observing such negative a minimum if data come from a Gaussian? suppose data are iid and distributed according to a (standardized) Gaussian. In this case the probability of observing data below the minus 8 sigma level is Φ(−8) and this for each of the 2600 observations. However the probability of observing AT LEAST a value less than or equal this is 1 minus the probability of never observing such value, that is, due to iid: 1 − (1 − Φ(−8))2600 ). It is clear that 1 − Φ(−8) is almost 1 but (1 − Φ(−8))2600 is much smaller. Is it small enough to make 1 − (1 − Φ(−8))2600 ) big enough so that a minimum value of −8σ To understand this use the binomial distribution. Question: suppose the probability of observing a −5σ in each of 2600 independent “draws” is 0.00000028. the probability of observing 10 such What is 10 events? The answer,computed with Excel is: 2600 0.00000028 (1 − 0.00000028)2590 = 0, 0000.... 10 Meaning that, at the precision level of Excel, we have a 0! While the exact number is not 0 this means that, at least in Excel the actual rounding error could be quite bigger that the result. For all purposes the answer is 0. Question: in this section we evaluated the “un-likelihood” of −5σ results in two different ways: first with a ratio between frequency and Gaussian based probability, then using the binomial distribution and, again, the Gaussian based probability. What is the connection between these two, different, arguments? 24 62 over 2600 iid observation from a standard Gaussian be not termed “anomalous”? The computation is not to simple as Φ(−8) is a VERY small number and the precision of the Excel routine for its computation cannot be guaranteed. However using Excel we get (1 − Φ(−8))2600 = .999999999998268 so that even if we take into account the 2600 observations the probability of observing as minimum of the sample a −(σ data point is still not really different from 0. I checked the result using Matlab (whose numerical routines should be more precise than Excel’s) getting a very similar result. In order to get 1 − (1 − Φ(−8))n ) in the range of .01 (still very unlikely) we would need n =15,000,000,000,000. These are open market days and would correspond to roughly to 59 billions of years. This is a time period roughly 4 times the current estimate of the age of our universe. (Again: beware of roundings!). In any sense a −8σ value is quite unlikely if data come from a standard Gaussian. It should be noticed, as a comparison, that for a T distribution with, say, 3 degrees of freedom the probability never observing a return of -8σ over 2600 days is only 0.347165227 so that the observed minimum (or a still smaller value) has a probability of 0.652834773 that is: by no means unlikely (in doing this computation recall that the Student’s T distribution variance is ν/(ν − 2) where ν is the number of degrees of freedom so that the quantile corresponding to −8 in a standard Gaussian is, now, p −8 ν/(ν − 2)). As you can see, while at first sight similar to the Gaussian, the T distribution is VERY different wrt the Gaussian when tail behaviour is what interests us. In the following section we shall consider the relevance of these empirical facts from the point of view of VaR estimation. Examples Exercise 4 - Non normal returns.xls Exercise 4b - Non normal returns.xls 5 Four different ways for computing the VaR First it is necessary to define the VaR25 . Suppose a sum W is invested in a portfolio at time t0 and we are interested in the p&l (profit and losses) between t0 and t1 that is: Wt1 − Wt0 . In all the (for us) relevant cases, that is when some “risk” is involved, this p&l shall be stochastic due to the fact that Wt1 as seen at t0 is stochastic. Our purpose is to give a simple summary of such stochastic behaviour of the p&l aimed at quantifying our “risk” in a possibly immediate way. 25 For this section I refer to the worksheets: exercise 4, 4b and 5. 63 Many such measures can be (and have been) suggested. The RiskMetrics procedure chose as its basis the so called VaR “Value at Risk”. Given a level α (usually very small: 1% to 5% as a rule) of probability, the VaR is defined as the α-quantile of the distribution of Wt1 − Wt0 . The definition of a α-quantile xα for the distribution of a random variable X is easy to write down and understand when the distribution of X : FX (x) is continuous a strictly increasing at least in an interval xl , xu such that FX (xl ) < α < FX (xu ), In this case we simply have xα ≡ x : P (X ≤ x) = FX (x) = α and xα = FX−1 (α) Where the inverse of FX (x), that is:FX-1 (α) is defined in a unique way, continuous and strictly increasing at least for FX (xl ) < α < FX (xu ). In this case the α-quantile is nothing but the value of X which corresponds to a cumulated probability exactly equal toα and we indicate such value with xα . In the case of a cumulative distribution function with jumps (corresponding to probability masses concentrated in specific values of x) there may be no x such that FX (x) = α for a given α. In this case the convention we use here is that of setting xα equal to the maximum of the values x of X such that x is of positive probability and FX (x) ≤ α. Barring this possibility, it is correct to say that the VaR at level αfor a time horizon between t0 and t1 of your investment is that profit and loss such that the probability of observing a worse one is equal to α. This definition seems to imply that we are required to directly compute a quantile of the p&l. This is not the case. In fact what is required is a quantile of the return distribution. Indeed we have Wt1 − Wt0 = Wt0 rt0,t1 ∗ Wt1 − Wt0 = Wt0 (ert0,t1 − 1) ∗ Where rt0,t1 and rt0,t1 are, respectively, the linear and the log return in the time interval from t0 to t1 . Since the functions return->p&l are both continuous and strictly increasing, the problem of finding the required quantile of the p&l is equivalent to the problem of finding the same in the distribution of returns and transform it back to p&l. In this section we shall consider four different estimates of the VaR which rely on different sets of hypotheses. Each estimate shall be presented in a very simple form, the reader is warned that actual implementation of any of these es 64 5.1 5.1.1 Gaussian VaR Point estimate of the Gaussian VaR A word of notice, in what follows we shall use R as the symbol meaning the random variable return and r as a possible value of such random variable. To avoid heavy notation we shall not indicate the kind of return we are speaking about or the time interval considered. Both these informations shall be clear from the context. Gaussian VaR is the most restrictive setting used in practice. We suppose that R is distributed according to a Gaussian density with expected value and variance (µ, σ 2 ) which are either known or estimated in such a way to minimize sampling error problems. A typical attitude is that of setting µ = 0 and estimating σ 2 , for instance, using the smoothed estimate described above26 . The important point to remember with the Gaussian density is that, under this hypothesis, knowledge of mean and variance is equivalent to the knowledge of any quantile. Under the Gaussian Hypothesis, the CDF is continuous so we can find a quantile with exactly α probability on its left for any α. The procedure is simple: we must find rα such that: P (R ≤ rα ) = α And proceeding with the usual argument, already well known from confidence intervals theory, we get: P (R ≤ rα ) = α = P ((R − µ)/σ ≤ (rα − µ)/σ) = Φ((rα − µ)/σ) = Φ(zα ) Where zα is the usual α quantile for the standard Gaussian CDF Φ(.). We have, then (rα − µ)/σ = zα rα = µ + σzα This is quite easy. The problem is that, for small values of α we are considering quantiles very far on the left tail and our previous empirical analysis has shown how the Gaussian hypothesis for returns (overall not so bad) is inadequate for extreme tails. Typically the problem of fat tails shall imply a undervaluation of the VaR. This hypothesis is not reasonable for linear returns, which are bounded below. It is however sometimes used in this case too. 26 65 5.1.2 Approximate confidence interval for the VaR Now a problem: we do not know both µ and σ. We must estimate them. The usual RiskMetrics procedure sets µ = 0 and estimates σ with the smoothed estimates introduced above. The estimate of the quantile of the return distribution, that is rα = σzα , shall then be rbα = σ̂zα . According to sound statistical practice we should implement this with a measure of sampling variability. Here we show a possible approximate and simple way to do so by computing a lower confidence bound.. Under the assumptions of uncorrelated observations with constant variance and zero expected value, it is easy to compute the variance of rbα2 = σ̂ 2 zα2 . In fact we have V (b r2 ) = z 4 V σˆ2 α α During the discussion about the different precision in estimating E(r) = µ and V (r) = σ 2 we derived for the Gaussian zero µcase the formula P 2i λ i=0,...,n 4 V σˆ2 = P 2 2σ i i=0,...,n λ Using the approximation 2i λ i=0,...,n 1−λ 2 h P (1 + λ) i i=0,...,n λ P This becomes 1−λ V σˆ2 = 2σ 4 (1 + λ) So that V (b rα2 ) = zα4 1−λ 2σ 4 (1 + λ) We can then estimate the σ 4 term by taking the square of the estimate of the variance and get 1−λ V̂ (b rα2 ) = zα4 2σ̂ 4 (1 + λ) A possible approximate confidence lower bound for the squared quantile estimate, with the usual “two sigma” rule, is given by (minus) the square root of a two sigma one sided interval for rbα2 . 66 s s " # q 1 − λ 1 − λ rbα2 + 2 V̂ (b rα2 ) = rbα2 + 2σ̂ 2 zα2 2 = rbα2 1 + 2 2 (1 + λ) (1 + λ) is a (upper) confidence bound for the square of the quantile estimate. In order to convert it into a (lower) bound for the quantile estimate we simply take v s u u 1−λ 2 rbα t1 + 2 (1 + λ) Now let us see some numbers. Suppose an estimate of the (daily) σ as the one obtained above from I.B.M. data between 1962 and 2005: 0.0164. You are computing a daily Gaussian VaR withα = 0.025. In this case z0.025 = −1.96 This gets a quantile point estimate equal to rbα = σ̂zα = 0.0164 ∗ (−1.96) = −0.0321 let us assume that our variance estimate comes from a typical implementation of the smoothed estimate formula with daily data, n = 256 (meaning roughly one year of data) and λ = 0.95. r q 1−λ 2 = 1.2053 and the bound shall be, roughly, In this case we have 1 + 2 (1+λ) 20% more negative that the point estimate of the quantile, that is v s u u 1−λ rbα t1 + 2 2 = −0.0321 ∗ 1.2053 = −.0387 (1 + λ) r q 1−λ 2 = 1.2053 only depends on the choice of λso that it can Notice that 1 + 2 (1+λ) be precomputed for any estimate sharing the same choice of λ. What we found is a confidence bound for the quantile of the (log) return. In order to transform this into a confidence bound for the VaR we need to know the amount invested W at time t0 . The bound to the VaR shall be W ∗ (e−.0387 − 1) = W ∗ (−.0380), that is a loss of 3.8% (and here the use of % is correct, why?) In order to further understand the consequences of using the smoothed estimate consider the case of the “classic” estimate with λ = 1 in P 2i i=0,...,n λ 4 V σˆ2 = P 2 2σ i i=0,...,n λ 67 so that 2 σ4 n+1 V σˆ2 = and the bound shall be s rbα r 1+2 2 n+1 and, due to n, we quickly have that the extreme of the interval q becomes almost identical √ to the point estimate. For n = 256 we already get rbα −0.0348. 5.1.3 2 1 + 2 √257 = rbα ∗ 1.084 = An exact interval under stronger hypotheses (not for the exam) For those interested in “exact” confidence intervals, we can derive a more formally strong result using the following theorem: Theorem 5.1. If {X1 , X2 , ..., Xn } are iid Gaussian random variables with expected value µ and standard deviation σ, then 2 S = 2 n X Xi − µ σ i=1 is distributed according to a Chi square distribution with n degrees of freedom. This implies that, if we estimate the variance with the non smoothed sample variance (with µ = 0): n X ri2 V ˆ(r) = n i=1 we have that n nV ˆ(r) X ri2 = ≈> χ2n 2 σ2 σ i=1 so that P( nV ˆ(r) nV ˆ(r) 2 ≤ χ ) = P ( ≤ σ2) = 1 − β n,1−β σ2 χ2n,1−β where χ2n,1−β is the 1 − β quantile of the χ2n distribution. From this we see that a β confidence interval lower extreme for the α quantile is given by s nV ˆ(r) bα = − zα Lr χ2n,1−β 68 With the same numbers we used above, and using a β = .025 so that χ2n,1−β = χ2256,.975 = 213.5747 an α = .025 this becomes r bα = − Lr 256 ∗ 0.01642 1.96 = −.0352 213.5747 against a point estimate of .0321 (here we drop the minus sign). Since this lower bound estimate is based on the non smoothed estimate of the variance it can be compared with the corresponding approximate bound found in the previous section. The bound was -0.0348 that is very similar to the one derived here in a rather different way. 5.2 5.2.1 Non parametric VaR Point estimate The non parametric VaR estimate stands, in some sense, at the opposite of the Gaussian VaR. In the non parametric case we suppose only that returns are i.i.d. but we avoid assuming anything about the underlying distribution. However, in order to find the VaR we need an estimate of the unknown theoretical distribution. In standard parametric settings, where we assume, e.g., normality, this is done by estimating parameters and then computing the required probabilities and finding the required quantiles using the parametric model with estimated parameters. Since we now are making no specific assumption about the return distribution we need to find an estimate of it which is “good” whatever the unknown distribution be. The starting point of all non parametric procedures is to estimate the theoretical distribution using the empirical distribution function. Suppose we have a sample of n i.i.d. returns with common distribution F (.) which yield observed values {r1 , r2 , ..., rn } then our estimate of F (.). shall be: n #ri ≤ r X = I(ri ≤ r)/n P̂ (R ≤ r) = F̂R (r) = n i=1 Where: #ri ≤ r means ’ the number of observed returns less than or equal to r and I(ri ≤ r) is a function which is equal to 1 if ri ≤ r and 0 otherwise. Under our hypothesis of i.i.d. returns with unknown distribution F (.) the above defined estimate works quite well in the sense that E( n X i=1 I(ri ≤ r)/n) = n E(I(ri ≤ r) = P (ri ≤ r) = F (r) n 69 and V( n X i=1 I(ri ≤ r)/n) = n V (I(ri ≤ r)) = F (r)(1 − F (r))/n n2 where the last passage depends on the fact that, for given r, I(ri ≤ r) is a Bernoulli random variable with P = F (r). Given this estimate of F , the non parametric VaR is, in principle, very easy to compute. Order the observed ri in an increasing way, then define r̂α as the smallest ri such that the observed frequency of data less than or equal to it is α, if such ri exists. This ri does not exists if α is not one of the observed values of cumulative frequencies, that is: if there exists no ri such that F̂R (ri ) = α. In this case we make an exception with respect to the common definition of empirical quantile and define r̂α as the biggest observed ri such that F̂R (ri ) < α. (Linear or other interpolations between consecutive observations are frequently used but we shall consider this in the “semi parametric VaR” section). This is nothing but a possible definition for the inversion of the empirical CDF27 . The problem with this estimate is that, if α is small, we are considering areas of the tail where, probably, we made very few observations. In this case the estimate could be quite unstable and unreliable. The reader should compare this estimate with the estimate of a quantile in the Gaussian case. In the Gaussian case we estimate quantiles inverting the CDF which, on its turn, is estimated indirectly, by estimating µ and σ, the unknown parameters. This implies that any data point tells us something about any point of the distribution (maybe very far from the observed point) as it contributes to the estimate of both parameters. In other terms, a parametric hypothesis allows us to estimate the shape of the distribution in regions where we do not make any observations. Instead, in the non parametric case, each data point, in some sense, has only a “local” story to tell. To be more precise: the non parametric estimate of the CDF at a given point r does not change if we change in any way the values of our data provided we keep constant the number of observations smaller than and greater than r. So, we use very little information from the data in a non parametric estimate while the influence of any data point on a parametric estimate is big. An unwritten law of Statistics is that, if you use little information you are going to get an estimate which is robust to many possible hypotheses of the data distribution but with a high sampling variability; on the other hand, if you use a lot of information in your data, as you do in a parametric model, you are going to have an estimate which is not robust but with a Our choice does not correspond to some definition of empirical quantile you may find in Statistics books. In particular, in the case where no ri exists such that F̂R (ri ) = α the empirical quantile is sometimes defined as the smallest observed ri such that F̂R (ri ) > α. This would be not proper for our purpose which is to estimate the size of a possible loss and, if needed, exaggerate it on the safe side. 27 70 smaller sampling variability. This is what happens in the case of non parametric VaR when compared to, say, Gaussian VaR. 5.2.2 Confidence interval for the non parametric quantile estimate Let us study a little bit these properties by computing a one side confidence interval for the αquantile rα on the basis of a simple random sample of size n. Suppose we order our data from the smallest to the biggest value (that is: we compute the “order statistics” of the sample). Call the j − th order statistic r(j) . Using the “integer part” notation where [c] is the largest integer smaller that or equal to c, the above defined estimate of rα can be simply written as r̂α = r([nα]) . This means that our estimate is the nα ordered observation, quantile if nα is an integer, or the ordered observation corresponding to the largest integer smaller that nα. This is a sensible choice but, as usual, due to sampling variability, the estimate could be either more or less negative than the “true” (and unobservable) rα . We are would be quite worried if rα ≤ r([nα]) (the = is here for cautiousness sake). A possible strategy in order to lower the probability of this event is that of building a lower bound for the estimate based on a r(j) with j < [nα] In order to choose this we must answer the following question. What is the probability that the “true” α quantile, say rα is smaller than or equal to any given j − th order statistic (ordered observation) r(j) in the sample? If this event happens and we used this quantile as estimate, the estimate shall be wrong in the sense that we shall undervalue the possible loss (as note above, the “equal” part we put in as an added guarantee). This error is going to happen if, by chance, we make at most j observations smaller than, or equal to rα . In fact when, for instance, the number of observations less than rα is, say, j − 1 the jth empirical quantile shall be bigger (less negative) that rα (or equal to, see the previous sentence). Since observations are iid and, supposing a continuous underlying distribution, the probability of observing a return less than or equal to rα is, by definition, α, the probability of making exactly i observations less than (or equal to) rα (and so n − i bigger than rα ) is n αi (1 − α)n−i i We then have that the probability of making at most j observations less than (or equal to) rα , that is, the probability that r(j) be greater than or equal to rα is equal to the sum of the probabilities of observing exactly i returns smaller that or equal to rα for i = 0, 1, 2, ..., j. For i = 0 all observations are greater than rα ; for i = 1 only the smallest observation is smaller that or equal to rα and so on up to i = j where we have exactly j observations smaller than or equal to rα (we are including the case r(j) = rα 71 because we want to be on the ”safe side” and avoid a possible undervaluation of the risk). Obviously, from i = j + 1 onward, we have j + 1 or more observations smaller than or equal to rα , so that r(j) shall be, supposing the probability of “ties” (identical observations) equal to 0 as in the case of a continuous F , strictly smaller that rα . In the end, the probability of “making a mistake” in the sense of undervaluing the possible loss, that is the probability of choosing an empirical quantile r(j) greater than rα , is given by: P (r(j) ≥ rα ) = j X n i=0 i αi (1 − α)n−i Now the confidence limit: to be conservative, we want to estimate rα with an empirical quantile r(j) such that we have a small probability β that that the true quantile rα is smaller than its estimate. This, again, is because we are willing to overstate and not to understate risk hence, we “prefer” to choose an estimate more negative than rα that a less negative one. Obviously, we would also like not to exaggerate on the safe side. Our strategy shall be as follows: we choose a r(j) such that P (r(j) ≥ rα ) ≤ β for a given β which represents with its size how much we are willing to accept an under estimation of the risk (the smaller the β the more adverse we are at underestimating rα ). On the other hand we do not want j to be smaller (that is r(j) more negative) than required. Summarizing this we must solve the problem max(j) : P (r(j) ≥ rα ) ≤ β This is going to be the extreme of our one tail confidence interval. this: the expected value of the random variable i with probability function Notice n i α (1 − α)n−i is equal to nα so that we could “expect” the empirical quantile cori responding to the index j just smaller than or equal to nα to be the “right choice” (and exactly this choice of point estimate was made in the previous paragraph). e.g. if α = .01 and n = 2000, intuitively we could use as an estimate of rα the empirical quantile r(20) . However, if we make this choice, we are going (for n and α not too small) to have roughly fifty fifty probability that the true quantile is on the left or on the right of the estimate. This is due to the central limit theorem according to which j X n i=0 i αi (1 − α)n−i ≈ Φnα;nα(1−α) (j) If the approximation works for our n and α we see that nα becomes the mean of (almost) a Gaussian, hence the probability on the right and on the left of this becomes .5. For reasons of prudence fifty/fifty is not good for us, we go for a smaller probability that the chosen quantile be bigger than rα , that is for a β smaller than .5. 72 For this reason we choose an empirical quantile corresponding to a smaller j to the j just smaller than (or equal to) nα and we do this according to the above rule. The just quoted central limit theorem, if n is big and α not too small, simplifies our computations with the following approximation: P (r(j) ≥ rα ) = j X n i=0 i = Φ0;1 αi (1 − α)n−i ≈ Φnα;nα(1−α) (j) = j − nα ! p nα(1 − α) With this approximation, we want to solve max(j) : Φ0;1 j − nα p nα(1 − α) ! ≤β So that our solution is given by the biggest (integer) j such that √ j−nα ≤ zβ or, nα(1−α) p that is the same, the biggest (integer) j such that j ≤ nα + nα(1 − α)zβ . Using the more compact “integer part” notation and calling r̂αβ our lower bound, we have: r̂α,β = r([nα+√nα(1−α)z ]) β p Notice that [nα + nα(1 − α)zβ ] does not depend on the observed data but on α, β, n only. Hence, the solution, in terms of j, that is: which ordered observation to use, (obviously NOT in terms of r(j) ) it is known before sampling. Suppose, for instance, you have 1000 observations and look for the 2% VaR. The most obvious empirical estimate of the 2% quantile is the 20th ordered observation, but, according to the central limit theorem, the probability that the true 2% quantile is on its left (as on its right but this is not important for us) is 50%. To be conservative you wish for a quantile which has only 2.5% probability of being on the right of the 2% quantile. Hence you choose a β of 2.5% (zβ = −1, 96) and you get p p nα + zβ nα(1 − α) = 1000 ∗ .02 − 1.96 ∗ 1000 ∗ .02 ∗ .98) = 11.32 According to this result your choice for the lower (97.5%) confidence bound for the (2%) VaR is given by the [11.32] = 11-th ordered observation that is, roughly, the 1% empirical quantile. Beware: do not mistake α for β. The first defines the quantile you want to estimate (rα ) and the second the confidence level of the confidence interval. Is this prudential estimate much different w.r.t. the simple “expected” quantile? It depends on the distance between ranked observations on the tail for observed cumulative frequencies of value about α. If the tail goes down quickly the distance is 73 small and the difference between, in this case, the 11th and the 20th quantile shall not be big. On the contrary, with heavy tails the difference between the 1% and the 2% empirical quantile can be quite big. As an example consider the case of the I.B.M. data between May 19th 1995 to Sep 28th 2005 discussed above. The point estimates of the 2.5% and 1% quantiles are the 2.5% empirical (ranked obs 66th) quantile is -4.105% and the 1% empirical quantile (ranked obs 26th) is -5.57%. These point estimates correspond to 97.5% confidence bonds of -4.67% (obs 50) and -6.53% (obs 16). In the first case roughly .5% more negative than the point estimate, in the second case 1%. The reason for the difference is that around the 1% empirical quantile observations are more “rarefied”, hence with large intervals in between, than around the 2.5% empirical quantile. With the same data, a Gaussian VaR estimate and using, for comparison, the full sample standard deviation as estimate of σ (value 0.021421), we get, for the the 2.5% VaR, a point estimate of -4.14%, to be compared with the 2.5% empirical quantile -4.105% (confidence bound -4.67%). However in the Gaussian case the (approximate 2σ) lower confidence limit , given the more than 2600 observation and the unsmoothed estimate, is -4.23%: very similar to the point estimate. As we did see a moment ago this is not true for the empirical VaR (.5% of difference between the estimate and the confidence limit). Things are worse on more extreme quantiles. If we compute the 1% quantile in the Gaussian case, we get -4.93% with a (two σ) bound of -5.02% to be compared with the non parametric -5.57% and the corresponding bound of -6,53%.28 . When we are evaluating extreme quantiles two “negative” forces sum. First the empirical distribution is very “granular” in the tails (very few observations). Second the empirically observed heavy tails imply the possibility of considerable difference between contiguous quantiles, bigger that expected in the case of Gaussian data. Non parametric VaR, sometimes dubbed “historical” VaR because it uses the observed history of returns in order to estimate the empirical CDF, is probably the most frequently used in practice. Again, confidence limits as often ignored and this could be due to their dismal “big” size. The problem of a big sampling variance for such estimates is very well known. Applied VaR practitioners and academics have suggested in the last years, an amazing quantity of possible strategies for improving the quality of the non parametric tail estimate. Most of these suggestion fall in two categories, semi parametric modeling of the distribution tails and filtered resampling algorithms. These estimates may change very much if we change the sample. For instance, with a longer stretch of data: between 1962 and 2005, the standard deviation is 0.0164, the 2% Gaussian var is -3.37% while the 2% empirical quantile is -3.4%. 28 74 In the following subsection we shall consider a simple example of semi parametric model. The resampling approach is left for more advanced courses. 5.3 Semi parametric VaR Semi parametric VaR mixes a non parametric estimate with a parametric model of the left tail. The aim is that of conjugating the robustness of the non parametric approach with the greater efficiency of a parametric approach. As we did see in the previous section, the non parametric approach can result in good evaluations of the VaR for values of α not very small. For small α its sampling variability may be non negligible even for big samples. However, in VaR computation we look for a quantile estimate for small α. The idea of a semi parametric approach is that of using a parametric model just for the tail of the distribution beyond a small but not to small αquantile. The plug in point of the parametric model is estimated in a non parametric way, from that point onward a parametric model is used in order to estimate the required extreme quantile. The reason why this may work is that, on the basis of arguments akin to the central limit but considering not means but extreme order statistics, we can prove that, while we may have many different parametric models for probability distributions, the tails of such distributions behave in a way that can be approximated with few (typically three) different parametric models. Here is not the place to introduce the very interesting topic of “extreme value theory”, a recent fad in the quantitative Finance milieu, so, we shall not be able to fully justify our choice of tail parametric model. Be it sufficient to say that a rigorous justification is possible. We suppose that for r negative enough: P (R ≤ r) = L(r)|r|−a where, for such negative enough r, L(.) is a slowly varying function for r → −∞ = 1∀λ > 0 and you can understand this as implying (Formally this means limr→−∞ L(λr) L(r) that the function L to be approximately a constant for big negative values of r) and a is the speed with which the tail of the CDF goes to zero with a polynomial rate. This is sometimes called a “Pareto” tail because a famous density showing this tail behaviour bears the name of Vilfredo Pareto. This choice of tail behaviour could be justified on the basis of limit theory, as hinted at before, or on the basis of good empirical fitting to data. Notice that the Gaussian CDF has exponential tails, which go to 0 much faster than polynomial tails. Pareto tails are, thus, a model for “heavy” tails. Provided we know where to plug in the model (that is: which value of r is negative enough) our first task is that of estimating a, the only parameter in the model. In 75 order to do so we take the logarithm of the previous expression and we get: log(P (R ≤ r)) = log(L(r)) − a log(|r|) We then assume that, maybe with an error, log(L(r)) can be approximated by a constant C: log(P (R ≤ r)) ≈ C − a log(|r|) this expression begins to be similar to a linear model. In fact, if, in correspondence of any observed ri we may estimatelog(P (R ≤ ri )) with log(F̂R (ri )) and summarize the various approximations in an error term ui , we have: log(F̂R (ri )) = C − a log(|ri |) + ui A linear regression based on this model shall not work for the full dataset of returns, but it shall work for a properly chosen subset of extreme negative returns. A simple way to find the proper subset of observations is that of plotting log(F̂R (ri )) against log(|ri |) for the left tail of the distribution. Typically this plot shall show a parabolic region (consistent with the Gaussian hypothesis) followed by a linear region (consistent with the polynomial hypothesis). The regression shall be run with data from the second region. Suppose we now have an estimate for a, how do we get an estimate of the quantile? the problem is that of plugging in the parametric tail to the non parametric estimate of the CDF. The solution is simple if we suppose to have a good non parametric estimate for the quantile rα1 where α1 is too big for this quantile estimate to be used as VaR. What we need is an estimate of rα2 for α2 < α1 . If we suppose that the tail model is approximately true for both quantiles we have: a L(rα1 ) rα2 α1 = α2 L(rα2 ) rα1 L(r ) But the ratio L(rαα1 ) should be very near to 1 (the same slow varying function computed 2 at non very far away points) so that we can directly solve for rα2 : rα2 = rα1 α1 α2 a1 Given the non parametrically estimated rα1 , an estimate of a (based on the above described regression) and for a chosen α2 we are then able to estimate the quantile rα2 . 76 77 -9 -8 -7 -6 -5 -4 -3 -2 Log-Log plot of extreme negative (in absolute value) data See how the linear hypothesis seems to work on the left of -3/-4 sigma -1 -5 -4,5 -4 -3,5 -3 -2,5 -2 -1,5 -1 5 -1 -0,5 0 0 5.3.1 Confidence interval for the semi parametric estimate (not for the exam). We are interested in a lower bound for a. In fact, we see from the above formula that the bigger is a (which is positive by definition) the faster shall be the decline of the log CDF and the nearer the quantile. So, our risk is to exaggerate a due to sampling error. If we suppose, with some excess of faith, that the model log(F̂R (ri )) = C − a log(|ri |) + ui satisfies the hypotheses required for a linear model (which we shall analyze in detail during this course) we have that the best estimate (the OLS estimate) of −a shall be −â = Cov(log(F̂R (ri )); log(|ri |)) V ar(log(|ri |)) and the sampling variance of this shall be given by V ar(−â) = V ar(ui ) 1 V ar(log(|ri |)) n and V ar(ui ) under standard OLS hypotheses can be estimated on the basis of the errors of fit of the linear model given by ûi = log(F̂R (ri )) − Ĉ + â log(|ri |) using the formula ˆ i) = V ar(u n X û2i /(n − 2) i=1 The lower bound of a 1 − β one sided interval for a shall be given by s ˆ i) 1 V ar(u âL = â −1−β tn−2 V ar(log(|ri |)) n This implies a lower bound for the estimated quantile given by â1 α1 L L rα2 = rα1 α2 A detailed discussion of this method with suggestions for the choice of the subset of data on which to estimate the tail index, and formulas for more sophisticated confidence intervals may be found in 29 “Tail Index Estimation, Pareto Quantile Plots, and Regression Diagnostics”, Jan Beirlant, Petra Vynckier, Jozef L. Teugels, Journal of the American Statistical Association, Vol. 91, No. 436 (Dec., 1996), pp. 1659-1667. Jstore link http://www.jstor.org/stable/2291593 29 78 However the reading of this paper is not required for the course. A comparison of Gaussian, non parametric and semi parametric VaR is shown in detail in the Excel worksheet Exercise 5 VaR.xls. 5.4 Mixture of Gaussians An intuitive idea any observer of the market could hold is that trading days are not all the same. Most days go unnoticed while, for a small number of days, the market seems to work at a fast rate, time seems to pass faster and volatility is higher. This common notion is consistent with the fact that relevant information does not arrive to the market as a continuous, uniform flow but rather in bits and chunks. A possible very simple formalization of this observation is as follows. There exist two type of day: 1 and 2. Conventionally we shall label with 1 “standard” days, and with 2 “fast” days. Suppose that in both days the distribution of returns is Gaussian with the same expected value µ but with different variances:σ12 , σ22 . We do not know, either a priori or a posteriori, in which type of day we are going, or did, live. Both days are compatible with any kind of return even if, obviously, returns far in the tails are more likely in fast days. In other terms we do not observe data from the two separate distributions of returns but a mixture of data. The density of this mixture shall be given by the mean of the two Gaussian densities with weights P and 1 − P where P is the probability of being in a ordinary day. If we indicate with N (r; µ, σ 2 ) the Gaussian density with expected value µ and variance σ 2 we can compute the marginal density for each observation ri as: f (ri ; µ, σ1 , σ2 , P ) = N (ri ; µ, σ12 )P + N (ri ; µ, σ22 )(1 − P ) (this is a simple application of the standard rule for computing the marginal distribution given the conditional distribution and the probability of the conditioning events). Suppose now we have a i.i.d. sample where the density of each observation is the mixture f . In order to estimate the unknown parameters, a sensible procedure is as follows. First step: estimate µ. Here we exploit the fact that µ is the same in both day types. Hence the expected value of the mixture is, again µ and we can estimate it using the simple sample average: X ri µ̂ = n i=1,...,n We then plug this estimate into f and build the likelihood of our sample r: Y `r (σ1 , σ2 , P ) = f (ri ; µ̂, σ1 , σ2 , P ) i=1,...,n We can then compute the maximum likelihood estimates of the unknown parameters σ1 , σ2 , P as the values which maximize ` (or, typically, its logarithm). Notice that this 79 procedure does not yield to analytic computation and requires numerical optimization methods. This simple mixture model, while rough, can be easily implemented and gives a very good approximation of the return distribution with the exception of extreme tails. The tails of a Gaussian mixture are still exponential while the extreme tail of the empirical CDF seem to decrease at a slower, maybe polynomial, rate. However the quality of fit is, in general, quite good up to α values even smaller than 1% that is, values which allow VaR estimates. This is a very simple example of mixture models. Many variants are possible (and can be found in real world applications and in academic literature): we can increase the number of components, we can also use different kinds of distribution as components, we can model a dependency of P on time and on other observable variables (a sensible choice are past squared returns). 80 81 -0,15 -0,1 -0,05 0 5 10 15 20 25 30 0 0,05 0,1 Two Gaussians with the same expected value and different stdev 0,15 82 -0,15 -0,1 -0,05 0 5 10 15 20 25 30 0 0,05 0,1 Mixture of two gaussians compared with single gaussian with the same mean and variance as the mixture 0,15 φ(x;μ,s) pφ(x;μ,s1)+(1-p)φ(x;μ,s2) Examples Exercise 5 - VaR.xls Exercise 5b - Gaussian Mixture Model.xls Required Probability and Statistics concepts. Sections 6-12. A quick check. The main difference between the first and the second part of the course is due to the fact that in the second part we are mainly interested with vectors of returns. This is the obvious point of view of any professional involved in asset pricing, asset management and risk management. To begin with, we need a compact and reasonably clear notation for dealing with vectors of returns, both from the mathematical and the probabilistic/statistical point of view. For this reason the second part of these handouts opens with two chapters: 6 and7 dedicated to a quick introduction to the basic notation. You can find something more in the appendixes of these handouts: 18. Most of the second part of these handouts is centered on the study and of the general linear model 9 and of its applications in Finance. I expect this topic to be new for most of the class, hence the handouts contain a rather detailed and complete, if simple, introduction to this topic. Another important tool introduced in the following chapters is principal component analysis 11 in the context of linear asset pricing models. Most of what follows is self contained, however some basic concept and result of Probability and Statistics is required. You can find this in the appendix 19. Among the most important of these concepts and results to be added to those already summarized in the first part of these handouts, I would point out: conditional expectation and regressive dependence, point estimation, unbiasedness, efficiency. Again, a short summary of these cam be found in 19. Among the most important points see: from 19.43 to 19.47, from19.91 to 19.104. 6 Matrix algebra I suppose the Reader knows what is a matrix and a vector and the basic rules for multiplication between matrices and matrices and scalars and for sum between matrices. I also suppose the Reader to know the meaning and basic properties of a matrix inverse and of a quadratic form. This very short section only recalls a a small number of matrix results and presents a very useful result called “spectral decomposition” or “eigendecomposition” theorem. Moreover we consider some differentiation rule. 83 In what follows I’ll write sums and products without declaring matrix dimensions sum and multiplication. I’ll always suppose the matrices to have the correct dimensions. The inverse of a square matrix A is indicated by A−1 with A−1 A = I = AA−1 . A property of the inverse is that, if A and B have inverse then (AB)−1 = B −1 A−1 . Notice that, if A and B are square invertible matrices and AB = I then, since (AB)−1 = I = B −1 A−1 , by multiplying on the left by B and on the right by A we have BB −1 A−1 A = BA = I. The rank of a matrix A (no matter if square or not): Rank(A) is the maximum number of linearly independent rows or columns in A. Put in a different way, the rank of a matrix is the order of the biggest (square) matrix that can be obtained by A deleting rows and/or columns and whose determinant is not zero. Obviously, then, the rank of a matrix cannot be bigger that its smaller dimension. A fundamental property of the rank of the product is this: Rank(AB) ≤ min(Rank(A); Rank(B)) . If B is a q × k matrix of rank q then Rank(AB) = Rank(A). Analogously, if A is a h × q matrix of rank q then Rank(AB) = Rank(B). Applying this we have that: Rank(AA0 ) = Rank(A). A symmetric matrix A is called positive semi definite (psd) iff for any column vector x we have x0 Ax ≥ 0. If the inequality is strong (>) for all the vectors x not identically null, then the matrix A is called “positive definite” (pd). Often we must compute derivatives of functions of the kind x0 Ax (a quadratic form) or x0 q (a linear combination of elements in the vector q with weights x) with respect to the vector x. In both cases we are considering a (column) vector of derivatives of a scalar function w.r.t. a (column) vector of variables (commonly called a “gradient”). There is a useful matrix notation for such derivatives which, in these two cases, is simply given by: ∂x0 Ax = 2Ax ∂x and ∂x0 q =q ∂x The proof of these two formulas is quite simple. We give a proof for a generic element k of the derivative column vector XX x0 Ax = xi xj ai,j i ∂ P P i j xi xj ai,j ∂xk = X j6=k j X X X xj ak,j + xi ai,k +2xk ak,k = xj ak,j + xj ak,j +2xk ak,k = 2Ak,. x i6=k j6=k 84 j6=k Where Ak,. means the k − th row of A and we used the fact that A is a symmetric matrix. Moreover X x0 q = xj q j j ∂x0 q = qk ∂xk An important point to stress is that the derivative of a function with respect to a vector always has the same dimension as the vector, so, for instance (remember that A is symmetric): ∂x0 Ax = 2x0 A ∂x0 A multi purpose fundamental result in matrix algebra is the so called “spectral theorem”: Theorem 6.1. If A is a (n × n) symmetric, pd matrix then there exist a (n × n) orthonormal matrix X and a diagonal matrix Λ such that: X 0 A = XΛX 0 = λ j xj xj j where xj is the j −th column of X and is called the j −th eigenvector of A, the elements λj on the diagonal of Λ are called the eigenvalues of A. These are positive, if A is pd, and can always be arranged (rearranging also the corresponding columns of X) in non increasing order. The formula A = XΛX 0 is called the “spectral decomposition” of A. If the matrix A is only psd with rank m < n a similar theorem holds but the matrix X has only M columns and the matrix Λ is a square m × m matrix. Notice that the spectral theorem implies that the rank of a p(s)d matrix A is equal to the number of non null eigenvalues of A. A property of the eigenvectors of a pd matrix A is that XX 0 = I (and, since in the pd case X is square, we also have X 0 X = I). That is: the eigenvectors are orthonormal. A nice result directly implied by this theorem when A is pd is this: A−1 = XΛ−1 X 0 . In fact A−1 = (XΛX 0 )−1 = X 0−1 Λ−1 X −1 = XΛ−1 X 0 . The Reader must notice that computing the spectral decomposition of a matrix, while straightforward, from the numerical point of view, is by no mean easy to do by hand. In order to understand this we can post multiply A times the generic xi : X 0 Axi = XΛX 0 xi = λj xj xj xi j 85 Each term in the sum is going to be equal to 0 (orthonormal xj vectors) except the ith which is going to be equal to xi λi so that xi solves the equation (A − λi I)xi = 0. Notice that x0i xi = 1 so that the “trivial” solution: xi = 0 is NOT a solution of this problem. Hence, any feasible solution requires |A − λi I| = 0 so that we see that the ,λi -s are the roots of the equation: |A − λI| = 0 (the so called “characteristic equation” for A). If this determinant equation is written down in full it shall be seen that it is a polynomial equation in the variable λ of degree equal to the rank k of A (its size, if A is of full rank). As it is well known since high school days, polynomial equations have k real and/or complex roots (the so called “fundamental theorem of algebra”). However, an explicit formula for finding such roots only exists (in the general case) for k ≤ 4. On the other hand, finding the roots of this equation is such relevant a problem in applied Mathematics that numerical algorithms for computing them exist at least from the times of Newton. P 0 The representation A = j λj xj xj makes obvious many classic matrix algebra results. For instance, we know that Az = 0 may have nontrivial solutions only if A is non invertible. In the case of s symmetric psd matrix this implies that the number of positive is smaller than the size of the matrix. In this case, writing P eigenvalues 0 Az = j λj xj xj z = 0 immediately shows that the solution(s) to the homogeneous linear system must be found among those (non null) vectors z which are orthogonal to each eigenvector xj . Such vectors, obviously, cannot exist if A is invertible30 . A last useful result is the so called “matrix inversion lemma” (A − BD−1 C)−1 = A−1 − A−1 B(CA−1 B − D)−1 CA−1 7 Matrix algebra and Statistics We use both random matrices and random vectors. A random matrix is simply a matrix of random variables, the same for a random vector. The expected value of a random matrix or vector Q is simply the matrix or vector of the expected values of each variable in the matrix or vector and is indicated as E(Q). For the lovers of formal language: k orthonormal vectors of size k (k−vectors) “span” a k−dimensional space, in the sense that any vector in the space can be written as a linear combination of such orthonormal vectors. For this reason, the only k−vector orthogonal to all k orthonormal vectors (which means that the vector is not a linear combination of them) is the null vector. On the other hand, given q < k orthonormal k−vectors, these span a q−dimensional subspace and there exist other k − q orthonormal k−vectors which are orthogonal to the first q and span the “orthogonal complement” of the space spanned by the q vectors. This is simply the space of all k−vectors which cannot be written as linear combinations of the q vectors, equivalently: the space of all vectors which are orthogonal to the q vectors. Yous see how a k−dimensional pd matrix implicitly defines a full orthonormal basis for a k−dimensional basis. Moreover, the knowledge of its eigenvectors allow us to split this space in orthogonal subspaces. 30 86 For a random (column) vector z we define the variance covariance matrix, indicated with V (z) but sometimes with Cov(z) or C(z) as: V (z) = E(zz 0 ) − E(z)E(z 0 ) For the expected value of a matrix or a vector we have a result which generalizes the linearity property of the scalar expected value. Let A1 and A2 be random matrices (of any dimension, including vectors) and B, C, D, G, F non random matrices. We have: E(BA1 C + DA2 E + F ) = BE(A1 )C + DE(A2 )G + F Where, as anticipated, we suppose that all the products and sums have meaning that is: the dimensions are correct and the expected values exist. The covariance matrix has a very important property which generalizes the well known result about the variance of a sum of random variables. Let z be a random (column) vector H a non random matrix and L a non random vector then: V (Hz + L) = HV (z)H 0 Suppose for instance that z is a 2 × 1 vector and H has a single row made of ones, in this case the above result yields to the usual formula for the variance of the sum of 2 random variables. 7.1 Risk budgeting This result has several applications in Finance, for example, suppose H is given by a single row that is: we are considering a single linear combination of the random variables in z. In this case: X X X V (Hz) = HV (z)H 0 = Hj Hi V (z)i,j = Hj Cj j i P j where Cj = i Hi V (z)i,j (the indexes i, j run over the dimensions of V (z)). In words: Cj is the linear combination with weights Hi of all the covariances between zj and all the zi (one of these is the covariance of zj with itself, that is, its variance). If we interpret z as a vector of (arithmetic) returns and H as a vector of fixed portfolio weights (for instance the weights for a one period buy and hold portfolio), the above result expresses the variance of the portfolio as a linear combination of (non necessarily positive) contributions due to each security return. Each contribution is the linear combination with weights Hi of all the covariances of the return zj with all returns or, in other words, the covariance of the return zj with the return Hz of the portfolio. Measures like this are commonly computed by portfolio managers as tools for measuring the risk contribution of each security in a portfolio, a practice called “risk budgeting” 87 7.2 A varcov matrix is at least psd An important property of a covariance matrix is that it is at least psd. In fact, if A is the covariance matrix, say, of the vector z, then the quantity x0 Ax is simply the variance of the linear combination of random variables x0 z where x is a non random vector. But a variance cannot be negative, hence x0 Ax ≥ 0 for all possible choices of x and this means A is at least psd. Suppose now that A is the (suppose pd) covariance matrix of the random vector z. The spectral decomposition theorem tells us that we can write A as A = XΛX 0 , with X orthogonal (i.e. XX 0 = X 0 X = I) and Λ diagonal (Λ = diag(λ1 , ..., λn )). The columns of X and the diagonal elements of Λ are the eigenvectors and eigenvalues of A, respectively (i.e. Axi = λxi ). Let p = X 0 z. Then p is a vector of non correlated random variables with covariance matrix Λ. In fact V (p) = X 0 V (z)X = X 0 AX = Λ. Moreover z = Xp, since Xp = XX 0 z = z. In other words, we can always write any random vector with pd covariance matrix as a set of linear combinations of uncorrelated random variables. Notice that E(p) = X 0 E(z) and E(z) = XE(p). When the covariance matrix of z is only psd, a similar result holds but the dimension of the vector p shall be equal to the rank of V (z) and so smaller that the dimension of z. This in particular implies that any pd matrix can be interpreted as a covariance matrix that is: there exist a random vector of which the given matrix is the covariance matrix (this is also true for psd matrices, but the proof is slightly more involved). Recall that, if A is only psd, say of size k and rank q < k, then it has q eigenvectors xl corresponding to positive eigenvalues. However, there exist k − q vectors zj∗ such that zj∗ 0 zj∗ = 1, zj∗ 0 zj∗0 = 0 for j 6= j 0 and ∗0 zj xl = 0 for any vector xl which is eigenvector of A. Using any of these zj∗ ,or any non zero scalar multiple of them, is then possible to build linear combinations of the random variables whose varcov matrix is A such that these linear combinations have zero variance. If A is a variance covariance matrix of linear returns for financial securities, this implies that it is possible to build a portfolio31 of such securities and, possibly, the risk free rate, such that the single securities returns are random but the overall position return is non random. By no arbitrage, any such position, being risk free, must yield the same risk free rate (otherwise you can borrow at the lower rate and invest at the higher rate for a sure, possibly unbounded, profit). This very important property is central in asset pricing theory and, more in general, in asset management. 31 Or, alternatively, a long short position. 88 (Compare this with the example regarding the constrained minimization of a quadratic form in the Appendix). As we shall see in the section dedicated factor models and principal components, it is often the case that covariance matrices of returns for large sets of securities are approximately not of full rank, that is: it may be that all eigenvalues are non zero but many of these are almost zero. In this case it is possible to build portfolios of risky securities whose return is “almost” riskless. This has important applications in (hedge) fund management and, more in general, trading ad asset pricing. 7.3 Note These are the barely essential matrix results for this course. Many more useful results of matrix algebra exist, both in general and applied to Statistics and Econometrics. For the interested student the Internet offers a number of useful resources. We limit ourselves to quote a “matrix cookbook” you could download from the Internet the title is “The Matrix Cookbook” 32 . Examples Exercise 6-Matrix Algebra.xls 8 The deFinetti, Markowitz and Roy model for asset allocation This section on the deFinetti-Markowitz-Roy model33 is here both as an exercise in matrix calculus and because it shall be useful to us in what follows. Suppose one investor is considering a buy and hold portfolio strategy for a single period of any fixed length (the length is not a decision variable in the asset allocation) from time t to time T . Let us indicate with R the random vector of linear total returns from a given set of (k) stocks for the time period. We suppose the investor knows (not “estimates”) both µR = E(R) and Σ = V (R). We suppose Σ to be invertible. Moreover, there exists a security which can be bought at t and whose price at T is known at t (typically a non defaultable bond) called “risk free security”. Let rf be the (linear) non random return from this investment over the period. A possible link is: http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf this worked when I last checked it in August 2019, but I cannot guarantee stability of the link. 33 See the appendix 15 for a summary of the story of this result. 32 89 The fund manager’s strategy is to invest in the risk free security and in the stocks at time t and then liquidate the investment at time T . The relative amounts of each investment in stocks are in the column vector w, while 1 − w0 1 is the relative amount invested in the risk free security. The return of the portfolio between t and T is: RΠ = (1 − w0 1)rf + w0 R So that the expected value and the variance of the portfolio return are E(RΠ ) = (1 − w0 1)rf + w0 µR and V (RΠ ) = w0 Σw The problem for the fund manager is to choose w so that, for a given expected value c of the portfolio return, the variance of the portfolio return is minimized. In formulas min w:(1−w0 1)rf +w0 µR =c w0 Σw Equivalently the fund manager could fix the variance and choose w such that the expected return is maximized. In both problems it would be sensible to use an inequality constraint. For instance, in the first problem, we could look for min w:(1−w0 1)rf +w0 µR =c w0 Σw We choose the = version just for allowing direct use of Lagrange multipliers as we’ll see in what follows. Notice that we do not assume the sum of the elements of w to be 1. This shall be true only if no risk free investment is made. However, obviously, if we complement the vector w with the fraction of portfolio invested in the risk free security, the sum of all the portfolio fractions is 1. Moreover we do not require each element of w to be positive. This can be done but not in the the straightforward way we are going to follow34 . Before going on with the solution of our problem. it is proper to discuss an interesting property of the mean variance criterion. The mean variance criterion may seem sensible and, actually, it usually is sensible. However it is easy to build examples where the results are counter intuitive. Suppose only two possible scenarios exist, both with probability 1/2. You are choosing between two securities: A and B. In the first state of the world the return of both securities is 0, in the second it is 1 for the first and 4 for the second. So the expected returns are .5 and 2 and the variances .25 and 4. Suppose now you wish to minimize the variance for getting at least an expected return of .5. Both 34 90 In order to solve the problem we consider its Lagrangian function (notice the rearranged constraint): L = w0 Σw − 2λ(rf − c + w0 (µR − 1rf )) And differentiate this with respect to the vector w and the scalar λ (remember the differentiation rules). ∂L = 2Σw − 2λ(µR − 1rf ) ∂w ∂L = −2(rf − c + w0 (µR − 1rf )) ∂λ We then define the system of “normal equations” as n Σw−λ(µR −1rf )=0 rf −c+w0 (µR −1rf )=0 It is easy to solve the first sub system for w as: w = λΣ−1 (µR − 1rf ) At this point we already notice that the required solution is a scalar multiple (λ) of a vector which does not depend on c. In other words the relative weights of the stocks in the portfolio are already known and do not depend on c. What is still not known is the relative weight of the portfolio of stocks with respect to the risk free security. This is a first instance of a “separation theorem”: the amount of expected return we want to achieve only influences the allocation between the risk free security and the stock portfolio but does not influences the allocation among different stocks (the optimal risky portfolio is uniquely determined). As a second comment we see that, had our objective been that of solving 1 0 0 0 max (1 − w 1)rf + w µR − w Σw w 2λ that is: had we wished to maximize some “mean variance” utility criterion, our result would have been exactly the same. Since the (negative) weight of the variance in this 1 criterion is given by − 2λ as a rule 1/λ is termed “risk aversion parameter”. investments yield at least that expected return, however, according to the mean variance criterion, you would choose the first as the variance of the second is bigger. But the second is going to yield you a return greater or equal to the first in both states of the world, in other words you are trading more for less. Notice that, since the two investments are perfectly correlated, you are going to invest just in one of them. The reason of the paradox is simple: you are considering "variance" as bad, but variance may be due both to bigger losses and to bigger gains. Since with the mean variance criterion you may end up trading more for less we can conclude that, in general, the mean variance criterion does not satisfy no arbitrage. 91 The rest of this section is useful as an exercise in matrix algebra (and maybe in Finance) but is not required for the exam If we see λ as the Lagrange multiplier, it is possible, and useful, to express it in more detail. Substitute the solution for w in the constraint equation and find: λ= c − rf (µR − 1rf )0 Σ−1 (µR − 1rf ) For the weights w, we have: w= (c − rf )Σ−1 (µR − 1rf ) (µR − 1rf )0 Σ−1 (µR − 1rf ) As an aid to intuition, consider what happens in the case Σ is diagonal with elements σR2 i . In this case the stock portfolio weights are proportional to (µRi − rf )/σR2 i . It is quite useful to write this as λ = (c − rf )/q where q = (µR − 1rf )0 Σ−1 (µR − 1rf ). Notice that q does not depend on c that is: q is independent of which optimal portfolio (choice of expected value) you are building. In fact, if w is the solution to our problem, we have w0 R) λ √ So that q represents the standard deviation of the solution portfolio divided by λ or, in other words, the “risk” of our optimal portfolio expressed in “units of λ”. Since E(RΠ ) − rf c − rf = λ2 λ= w0 V (RΠ ) V ( λ R) We also have V (RΠ ) λ= E(RΠ ) − rf q =V( So that, from q = (c − rf )/λ = (E(RΠ ) − rf )2 /V (RΠ ), we have √ E(RΠ ) − rf q= p V (RΠ ) As we did see before q is a constant identical for all efficient portfolios, that is: q = (µR − 1rf )0 Σ−1 (µR − 1rf ), this last equation implies that, in the expected value standard deviation space, efficient portfolios expected values and standard deviations √ √ are connected by a line whose slope in q. In other words q is both the so called “Sharpe ratio” and the so called “price of risk” of the efficient portfolio (we obviously suppose E(RΠ ) − rf ≥ 0)35 . It should be noticed, however, that this “price of risk” interpretation is a bit extraneous to the mean variance context and shall take all its weight in a CAPM context. But this is not a topic of our course. 35 92 Suppose now that we choose a c such that the corresponding w̃ is such that 10 w̃ = 1 that is: no investment in the risk free security. In other words we are considering the so called tangency portfolio. Since in the CAPM model the tangency portfolio becomes the market portfolio we call the corresponding c = E(RM ). In this case we have: 10 w̃ = (E(RM ) − rf )10 Σ−1 (µR − 1rf )/qM = 1 so that qM = (E(RM ) − rf )10 Σ−1 (µR − 1rf ) or, in other terms: w̃ = Σ−1 (µR − 1rf ) 10 Σ−1 (µR − 1rf ) The tangency portfolio is the portfolio where a line starting at the risk free rate, in the expected value standard deviation plane, is tangent to the efficient frontier. Notice that the weights of the (relevant, that is: possibly tangent) efficient frontier portfolios are simply given by w̃ changing rf . Due to the separation theorem, we can express the return of any efficient portfolio (RΠ ) as the return of a portfolio invested in the risk free rate and in the tangency portfolio: RΠ = (1 − γ)rf + γ w̃0 R where: γ = (c − rf )10 Σ−1 (µR − 1rf )/q Remember that c = E(RΠ ) so that: γ = (E(RΠ ) − rf )10 Σ−1 (µR − 1rf )/q A comment on Markowitz and the CAPM. The Markowitz model is not the CAPM: it only gives us all mean variance efficient combinations of a risk free and a given set of risky securities with given expected value vector and covariance matrix. For a given set of risky securities and a given risk free rate all efficient portfolios have the same Sharpe ratio as they are different linear combination of the risk free security and of the same risky portfolio. However, if we change set of securities the efficient risky portfolio changes, in general, and the corresponding Sharpe ratio changes too. However, suppose the set of risky securities contains ALL risky securities. It could be difficult, in this case, not to state that there exists one and only one (if the covariance matrix is invertible) optimal risky portfolio and, by consequence, only one possible Sharpe ratio for all possible (efficient) investments. Here we are very near to the CAPM and we are also very near to understanding that the use of a Sharpe ratio 93 for comparing securities, is, perhaps, unwarranted (Sharpe would say something much stronger). A last relevant observation is this: in this chapter we suppose expected values and covariances as given and known. What happens when (as is commonly the case) this is not true? 9 9.1 Linear regression Weak OLS hypotheses We begin with the so called “weak” OLS hypotheses. Let Y be an n × 1 random vector containing a sample of n observations on a “dependent” variable y. Let X be a non random n × k matrix of rank k, β a k × 1 non random vector and a n × 1 random vector. Let: Y = Xβ + and suppose E() = 0 and V () = σ2 In . These hypotheses are best understood in a statistical (that is: estimation in repeated samples) setting. Each sample we are going to draw shall be given by a realization of Y . In each sample X (observable) and β (unobservable) shall be the same. What makes Y “random” that is: changing in a (partially) unpredictable (for us) way from sample to sample, is the random “innovation” or “error” vector which we cannot observe. As for the “partially” clause: under the assumed hypotheses is clear that E(Y ) = E(Y |X) = Xβ So that the expected value of the random Y is not random (while it is unknown as β is unknown). In this sense we may say that we are modeling the regression function of Y on X that is: the conditional expectation of Y given X. However, since the matrix X is non random, which means that the probability of observing that particular X is one or, equivalently as observed above, that in any sample X (and β) shall always be the same (so that Y is random just due to the effect of the random element , in fact: V (Y ) = V (Xβ + ) = V () = σ2 In ), the conditional expectation shall be the same as the unconditional expectation. A more interesting model, from the point of view of financial applications, shall be considered below when we shall allow X to be itself random. 94 9.2 The OLS estimate The problem is to estimate β. Under the above hypotheses different estimation procedures lead all to the same estimate: the Ordinary Least Squares Estimate βbOLS . The simplest way for deriving βbOLS is trough its namesake, that is: find the value βbOLS that minimizes 0 : the sum of squared errors. The objective function shall be: 0 = (Y − Xβ)0 (Y − Xβ) The first derivatives with respect to β are: ∂(Y − Xβ)0 (Y − Xβ) = 2X 0 Xβ − 2X 0 Y ∂β The system of normal36 equations (where we ask for the β that sets to zero the first derivatives) is: X 0 Xβ = X 0 Y Since the rank of X is k, X 0 X can be inverted and the (unique) solution of the system is: βbOLS = (X 0 X)−1 X 0 Y As usual we do not check the second order conditions, we should! Informally we see that we are minimizing a sum of squares which may go to plus infinity, not to minus infinity so that our stationary point should be a minimum non a maximum (this is by no means rigorous but it could be made so). It is easy to show that βbOLS is unbiased for β. In fact: E(βbOLS ) = (X 0 X)−1 X 0 E(Y ) = (X 0 X)−1 X 0 Xβ = β where in the first passage we use the fact that X is non random and in the second one we use the hypothesis that β is non random and that E() = 0. It is also easy to compute V (βbOLS ): V (βbOLS ) = (X 0 X)−1 X 0 V (Y )X(X 0 X)−1 = (X 0 X)−1 X 0 σ2 In X(X 0 X)−1 = = σ2 (X 0 X)−1 X 0 X(X 0 X)−1 = σ2 (X 0 X)−1 . Here as in the Markowitz model, the term “normal” does not mean “usual” or “standard”. As we are going to see in a moment the first order conditions in systems like these require that each of a set of products between vectors be 0. This has to do with requiring that vectors be perpendicular and the term “normal” derives from Latin “normalis” meaning “done according to a carpenter’s square (“norma” in Latin)” a carpenter’s square, as is well known, is made of two rulers crossed at 90° (the triangular square of high school has an added side). By extension the term came to mean: “done according to rules” (in effect a square is two rule(r)s...) and from this the today most common, non mathematical, usage of the word . 36 95 9.3 The Gauss Markoff theorem All this not withstanding, the choice of βbOLS as an estimate of β based only on the minimization of 0 could be disputed: in which sense this should be a “good” estimate from a statistical point of view? A strong result in favor of βbOLS as a good estimate of β is the Gauss-Markoff theorem. Definition 9.1. If β ∗ and β ∗∗ are both unbiased estimates of β we say that β ∗ is not worse than β ∗∗ iff D = V (β ∗∗ ) − V (β ∗ ) is at least a psd matrix. Notice that in the univariate case this definition boils down to the standard definition. Moreover, suppose that what you really want is to estimate a set of linear functions of β say Hβ where H is any nonrandom matrix (of the right size so that it can can pre multiply β). Suppose you know that β ∗ is better than β ∗∗ according to the previous definition. In this case it is easy to show that Hβ ∗ is better than Hβ ∗∗ , according to the same definition, as an estimate of Hβ. In fact: V (Hβ ∗∗ ) − V (Hβ ∗ ) = H(V (β ∗∗ ) − V (β ∗ ))H 0 = HDH 0 And if D is at least psd then HDH 0 is at least psd (why?). This “invariance to linear transform” property is the stronger argument in favor of this definition of “not worse” estimator. As an exercise find a proof of the fact that on the diagonal of D all the elements must be non negative. In other terms: all the variances of the * estimate are not bigger than the variances of the ** estimate. Obviously this definition of “not worse” would amount to little if the estimates were allowed to be biased. In this case any vector of constants would be not worse than any other estimate, in terms of variance. For this we then require for unbiasedness.37 We are now just a step short in being able to state an important result in OLS theory. We would like to prove a theorem of the kind: the best unbiased estimate of β is β̂OLS . Alas, this is actually not true in this generality. The theorem results to be true if we further ask for the class of competing estimates to be linear in Y that is: each competing estimate must be of the form HY with H a known nonrandom matrix. Definition 9.2. We say that β̂ is a linear estimate of β iff β̂ = HY where H is a non random matrix. We thus arrive to the celebrated Gauss Markoff theorem38 . Theorem 9.3. Under the weak OLS hypotheses βbOLS is the Best Linear Unbiased Estimate (BLUE) of β. As an alternative we could us the concept of mean square error matrix in the place of the variance covariance matrix. 38 See 16 for some details about the history of this theorem. 37 96 Proof. Any linear estimate of β can be written as β̂ = ((X 0 X)−1 X 0 + C)Y with and arbitrary C. Since the estimate must be unbiased we have: E(β̂) = ((X 0 X)−1 X 0 + C)Xβ = β + CXβ = β and this is possible only if CX = 0. Let us now compute V (β̂): V (β̂) = σ2 ((X 0 X)−1 X 0 + C)((X 0 X)−1 X 0 + C)0 = = σ2 ((X 0 X)−1 + CC 0 + (X 0 X)−1 X 0 C 0 + CX(X 0 X)−1 ) but, since CX = 0 the last two terms in the above expression are both equal to 0. In the end we have: V (β̂) = σ2 ((X 0 X)−1 + CC 0 ) and this is V (βbOLS ) plus σ2 (CC 0 ) which is an at least psd matrix (why?). We have shown that the covariance matrix of any linear unbiased estimate of β can be written as the covariance matrix of βbOLS plus a matrix which is at least psd. To summarize: we have shown that βbOLS is BLUE. As we shall see in the “stochastic X” section, the Gauss-Markoff theorem still holds under suitable modifications of the weak OLS hypotheses in the case of stochastic X. There is an equivalent theorem which is valid when V () = Σ with Σ any non random (pd) matrix known up to a multiplicative constant. In this case the BLUE estimate is β̂GLS = (X 0 Σ−1 X)−1 X 0 Σ−1 Y where GLS stands for Generalized Least Squares, but we are not going to use this in this course. Notice that if Σ is not known up to a multiplicative constant the above is not an estimate. The proof begins by recalling that any pd matrix has a pd inverse. Moreover any pd matrix A can be written as A = P P 0 with P invertible. We then have Σ = P P 0 and Σ−1 = (P P 0 )−1 = P 0−1 P −1 . Now, multiply the model Y = Xβ + times P −1 as P −1 Y = P −1 Xβ + P −1 . Call now Y ∗ = P −1 Y ;X ∗ = P −1 X and ∗ = P −1 . We have that Y ∗ = X ∗ β + ∗ satisfies the weak OLS hypotheses, and, in particular, 0 V (∗ ) = V (P −1 ) = P −1 ΣP −1 = P −1 P P 0 P 0−1 = I. 0 0 We can then follow the standard proof up to the result: (X ∗ X ∗ )−1 X ∗ Y ∗ is BLUE. In the original data this is equal to (X 0 P 0−1 P −1 X)−1 X 0 P 0−1 P −1 Y = (X 0 Σ−1 X)−1 X 0 Σ−1 Y = β̂GLS where GLS stands for Generalized Least Squares. The result seems very general and, in a sense, it is so. However we should take into account that the above proof implies Σnon diagonal but KNOWN. Otherwise we could not compute P and the estimate. Most cases of a linear model with correlated residuals do not allow for “known” Σ and the GLS estimate cannot directly be used. Most estimates used in practice in these cases can be seen as versions of the GLS estimate where Σis itself “estimated” in some way. 97 We do not consider this (interesting and very relevant) topic in this introductory course. The above proof deserves some further consideration. Under the standard weak OLS hypotheses, both with non stochastic and with stochastic X, the OLS estimate works fantastically well: it minimizes the sum of squares of the errors, is unbiased, is BLUE. This is perhaps too much and we should surmise that some of this bonanza strictly depends on a clever choice of the hypotheses. This is exactly the case. We did just prove that, even in the case of non stochastic X, when the covariance matrix of residuals is NOT σ2 I, the BLUE estimate is NOT the OLS estimate. In this case “minimizing the sum of squared errors” is not equivalent to finding the “best” estimate in the Gauss Markoff sense. 9.4 Fit and errors of fit We call Ŷ = X β̂OLS the “fitted” values of Y . In fact Ŷ is to be understood as an estimate of E(Y ) = Xβ. On the other hand ˆ = Y − Ŷ bear the name of “errors of fit”. Notice that, by the first order conditions of least squares, we have: X 0 X β̂OLS = X 0 Y X 0 (X β̂OLS − Y ) = 0 X 0 ˆ = 0 This in particular implies 0 βbOLS X 0 ˆ = Ŷ 0 ˆ = 0 This result is independent on the OLS hypotheses and depends only on the fact that βbOLS minimizes the sum of squared errors. 9.5 R2 A useful consequence of this result, joint with the assumption that the first column of X is a column of ones, allows us to define an index of “goodness of fit” (read: how much did I minimize the squares?) called R2 . In fact: Y 0 Y = (Ŷ + ˆ)0 (Ŷ + ˆ) = Ŷ 0 Ŷ + ˆ0 ˆ where the last equality comes from the fact that Ŷ 0 ˆ = 0. 98 Moreover if X contains as first column a column of ones X 0 ˆ = 0 implies that the sum, hence the arithmetic average, of the vector ˆ: ¯ˆ is equal to zero. So: ¯ ¯ Ȳ = Ŷ + ¯ˆ = Ŷ and: ¯ Y 0 Y − nȲ 2 = Ŷ 0 Ŷ − nŶ 2 + ˆ0 ˆ where n is the length of the vectors (number of observations). In other words, indicating with V ar(Y ) the numerical variance of the vector Y (that is: the mean of the squares minus the squared mean), we have: V ar(Y ) = V ar(Ŷ ) + V ar(ˆ) We see that the variance of Y neatly decomposes in two non negative parts. There is no covariance! This is totally peculiar to the use of least squares and implies the definition of a very natural measure of “goodness of fit’”: R2 = V ar(Ŷ ) V ar(ˆ) =1− V ar(Y ) V ar(Y ) Notice that, in order to be meaningful, R2 requires for the presence of a column of ones (or any constant, in fact) in X. Otherwise the mean of the errors of fit may be different than 0 and the passage from sum of squares to variances is no more possible (the mean of Ŷ shall not, in general, be the same as the mean of Y ). Due to sampling error we may expect to observe a positive R2 even when there is non “regression” between Y and X. We can also say something about the expected size of an R2 when it should actually be 0. In fact, suppose, conditionally to a n × k matrix X of regressors, Y is a vector of n iid random variables of variance σ 2 so that there should be an R2 of 0 out of the regression of Y on X. However the expected value of the sampling variance of the elements of Y (that is: the denominator of the R2 ) is σ 2 while we can show that the expected value of sampling variance of the elements of so that the expected value of the sampling variance the error of fit vector ε̂ is σ 2 n−k n of Ŷ in the same regression is going to be σ 2 nk (because Y = Ŷ + ε̂) and this is the numerator of the R2 . While we know that the expected value of the ratio is not the ratio of the expected values, this could still be a good approximation, so we can say that, in the case of no regression at all between Y and X (that is: theoretical value of R2 equal to 0), the σ2 k expected value of the R2 is approximated by σ2n = nk and this number could be quite big if you use many variables and do not have many observations. This simple fact should make use wary in using regression as an “exploratory” tool for finding the “most relevant variables” in a wide set of potential candidates. Such attitude has been common, and was successfully criticized, many times in the past 99 and, today, is back on fashion within the “data mining” movement. “To be wary” does not mean “to utterly avoid”: taken with care such procedures, and, more in general, exploratory data analysis, may be useful. 9.6 More properties of Ŷ and ˆ Let us now study some property of Ŷ and ˆ depending on the OLS hypotheses. First we compute the expected values and covariance matrices of both vectors. E(ˆ) = E(Y − X βbOLS ) = Xβ − Xβ = 0 V (ˆ) = V (Y − X βbOLS ) = V (Y − X(X 0 X)−1 X 0 Y ) = V ((I − X(X 0 X)−1 X 0 )Y ) = = σ2 (I − X(X 0 X)−1 X 0 )(I − X(X 0 X)−1 X 0 ) = σ2 (I − X(X 0 X)−1 X 0 ) this because, by direct computation, we see that (I − X(X 0 X)−1 X 0 )(I − X(X 0 X)−1 X 0 ) = (I − X(X 0 X)−1 X 0 ) that is: (I − X(X 0 X)−1 X 0 ) is an idempotent matrix. With similar passages we find that: E(Ŷ ) = Xβ V (Ŷ ) = σ2 X(X 0 X)−1 X 0 In summary: we see that Ŷ is indeed an unbiased estimate of E(Y ). On the other hand we see that ˆ shows a non diagonal correlation matrix even if (or, better, just because) the vector is made of uncorrelated errors. This property of the estimated residuals is, in some sense, unsatisfactory and lead some researchers into defining a different estimate of the residuals (non OLS based) with the property of being uncorrelated under OLS hypotheses. This different estimate, which we do not discuss here, is known in the literature as BLUES residuals (where the ending S stand for “scalar” that is: with diagonal covariance matrix). 9.7 Strong OLS hypotheses and testing linear hypotheses in the linear model. This is a very short section for a very relevant topic. We are not going to deal with general testing of linear hypotheses but only with those tests which are routinely found in the output of standard OLS regression computer programs. 100 Let us begin with introducing Strong OLS hypotheses. In short these are the same as weak OLS hypotheses (with non stochastic X) plus the hypothesis that the vector is not only made of uncorrelated, zero expected value and constant variance random variable but also is a vector distributed as an n dimensional Gaussian density with expectation vector made of zeros and diagonal variance covariance matrix39 . Why this hypothesis? When we wish to test hypotheses we need to find distributions of sample functions, for instance we are going to need the distribution of β̂OLS . Up to now we know that under weak OLS hypotheses β̂OLS has expected value vector β and variance covariance matrix σ2 (X 0 X)−1 . Moreover we know that β̂OLS = β + (X 0 X)−1 X 0 that is: it is a linear function of (and β and X are non stochastic). With the added hypothesis we can then conclude that β̂OLS has Gaussian distribution with expected value vector β and variance covariance matrix σ2 (X 0 X)−1 . A k dimensional random vector ze is distributed according to a k dimensional Gaussian density with expected values vector µ and variance covariance matrix Σ if and only if the density of the any vector z of possible values for z̃is given by 39 k 1 1 f (z; µ, Σ) = (2π)− 2 |Σ|− 2 exp(− (z − µ)0 Σ−1 (z − µ)) 2 (As usual in the text we shall omit tildas for distinguishing between RV and their values when this should not create confusion). If we remember the properties of determinants and inverses of diagonal matrices we see from this formula that, in the case of diagonal covariance matrix, this density becomes a product of k uni dimensional Gaussian densities (one for each element of the vector z̃). So, in the Gaussian case, non correlation and independence are the In fact,if Σ is a diagonal matrix with diagonal terms σi2 same. 1 0 0 σ12 Qk .. we have |Σ| = i=1 σi2 and Σ−1 = . 0 so that 0 0 0 σ12 k f (z; µ, Σ) = (2π) = −k 2 k Y 1 σ i=1 i ! k 1 X (zi − µi )2 exp − 2 i=1 σi2 ! = Y k 1 1 (zi − µi )2 = f (zi ; µi , σi2 ) (2πσi2 )− 2 exp − 2 2 σ i i=1 i=1 k Y In words: with a diagonal covariance matrix the joint density is the product of the marginal densities, the definition of independence. An important property of a k dimensional Gaussian distribution is that, if A and B are non stochastic matrices (of dimensions such that A+B z̃ is meaningful), then the distribution of A+B z̃ is Gaussian with expected vector A + Bµ and variance covariance matrix BΣB 0 . Linear transforms of Gaussian random vector are Gaussian random vectors. This, for instance, implies that, in effect, if z̃ is a Gaussian random vector, then each z̃i is Gaussian as we stated a moment ago in the proof of equivalence between non correlation and independence for the Gaussian distribution. This is easy to see, just 0 0 write z̃i = 1i z̃ where 1i is a k dimensional row vector with null elements except one 1 in the i − th place and apply the linearity property. 101 This may seem too expedient: OK, computations are now simple but why Gaussian errors? In fact it often is too expedient, and the pros and cons of the hypothesis could (and are) discussed at length. For the moment we shall take it as a beginner’s use in the econometric world we live in, a use to be taken with much care. We suppose everybody knows what a Statistical Hypothesis is (see Appendix in case). We now define a “linear” statistical hypothesis. A linear hypothesis on β can be written as Rβ = c where R is a matrix of known constants and c a vector of known constants. For the purpose of this summary we shall concentrate on two particular R: a 1 × k vector R where only the j th element is 1 and the others are zeros and a (k − 1) × k matrix where the first column is of zeros an the remaining (k − 1) × (k − 1) square matrix is an identity. In both cases c is made of zeros (in the first case a single 0 and in the second a k − 1 vector of zeros). The first kind of hypothesis is simply that the j th β is zero (while all other parameters are free to have any value), the second kind of hypothesis is simply that all parameters are jointly zero (with the possible exception of the intercept). For (non trivial) historical reasons these hypothesis are considered so frequently relevant that any program for OLS regression tests them. The fact that these hypotheses be of your interest is for You to evaluate. I shall not detail the procedure for the test of the hypothesis that all parameters, except, possibly, the intercept are jointly equal to zero. I only mention the fact that the result of this test is displayed, usually, in any OLS program output. The name of this test is F test. A little more detail on the univariate test. The standard test for the Hypothesis H0 : βj = 0 against H1 : βj 6= 0 (complete by yourself the hypotheses) requires the distribution of β̂OLS , and this, as we wrote above, a strengthening of the OLS hypotheses which takes the form of the assumption that is distributed according to an n−variate Gaussian: ≈> Nn (0, σ2 In ). We do not discuss here the reasons for and against this hypothesis. Under this hypothesis, as seen above, we can show that: βbOLS ≈> Nk (β, σ2 (X 0 X)−1 ). Hence the ratio: β̂j − βj p σ2 {(X 0 X)−1 }jj (I drop the subscript OLS from the estimate in order to avoid double subscript problems) is distributed according to a standard Gaussian (i.e. N1 (0, 1)). Suppose now we set βj = 0 in the above ratio. In this case the distribution of the ratio shall be a standard Gaussian only if H0 : βj = 0 is true. This allows us to define a reasonable rejection region for our test. Reject H0 : βj = 0 with a size of the error of the first kind equal to α iff: β̂j p ∈ / [−Z1− α2 ; +Z1− α2 ] σ2 {(X 0 X)−1 }jj 102 (which can be written in many equivalent forms). In the above formula {(X 0 X)−1 }jj is the j−th element in the diagonal of (X 0 X)−1 . Z1− α2 is the quantile of the standard Gaussian which leaves on his left 1 − α2 of probability. This solves the problem if σ2 is known. In the case it is unknown estimate it with: σ̂2 = ˆ0 ˆ n−k and use as critical region: β̂j p ∈ / [−n−k t1− α2 ; +n−k t1− α2 ] 2 σ̂ {(X 0 X)−1 } jj where n−k t1− α2 is the quantile in a T distribution with n − k degrees of freedom which leaves on its left a probability equal to 1 − α2 . The use of the T distribution is the reason for the name given to this test: the T test. Another test whose results are as a rule reported in all standard outputs of a regression package is the F test. As in the case of The T test there exist many different hypotheses which can be tested using a F test but the standard hypothesis tested by the universally reported F test is: H0 : all the betas corresponding to non constant regressors are jointly equal to 0 H1 : at least one of the above mentioned betas is not 0 The Idea is that, if the null is accepted (i.e. big P-value), no “regression” exists (provided we did not make an error of the second kind, obviously). Hence the popularity of the test. 9.8 “Forecasts” Here we intend the term “forecast” in a very restricted meaning. Suppose you estimated β using a sample of Y and X, let us sat n observations (rows). Now suppose a new set of q rows of X is given to you and you are asked to assess what could be the corresponding Y . Set as this the question does not allow for an answer. We need to assume some connection between the old rows and the new rows of data. A possibility is as follows. Let the model for the n rows of data used to estimate β with β̂OLS be Y = Xβ + Suppose we now have data for m more lines for the variables in X and call these Xf . Let the model for the corresponding new “potential” observations be 103 Yf = Xf β + f And suppose (we consider here the general case where X can be stochastic) that E(|X, Xf ) = 0= E(f |X, Xf ), V (|X, Xf ) = σ2 In ,V (f |X, Xf ) = σ2 Im and E(0f |X, Xf ) = 0. (Notice the double conditioning to both X and Xf ) In this case the obvious (BLUE) estimate for E(Yf |X, Xf ) isYˆf = Xf β̂ with expected value Xf β and variance covariance matrix σ2 Xf (X 0 X)−1 Xf0 . If we define the “point forecast error” as Yf − Ŷf , the expected value of this shall be 0 and the variance covariance matrix σ2 (Im + Xf (X 0 X)−1 Xf0 ). Be very careful not mistaking these formulas with the corresponding ones for Ŷ . On the basis of these formulas and working either under strong OLS hp or using Tchebicev, it is possible to derive (exact or approximate) confidence intervals for the estimate of the expected value of each element in the new set of observations and for the corresponding point forecast errors. For instance, under the Gaussian hypothesis, the two tails confidence (α%) interval for the expected value of a single observation in the forecasting sample, under the hypothesis of a known error variance, corresponding to a row of values of Xf equal to xf is given by: q i h xf β̂OLS ± z(1−α/2) σ xf (X 0 X)−1 x0f The corresponding confidence interval for the point estimate, that is, the forecast interval which keeps into account the point estimate error, is: q i h xf β̂OLS ± z(1−α/2) σ (1 + xf (X 0 X)−1 x0f ) In the case the error variance is not known and it is estimated with the unbiased estimate ˆ0 ˆ σ̂2 = n−k as described above, the ponly changes to be made in the formulas are: σ substituted with its estimate σ̂ = σ̂2 and z(1−α/2) substituted with (n−k) t(1−α/2) that is: the (1 − α/2) quantile of a T distribution having as degrees of freedom parameters the difference between n and k: the number of rows and columns in X, the regressors matrix in the estimation sample. It is easy to see that the second interval shall always be bigger than the first, as it takes into account not only of the sampling uncertainty in estimating β but also of the uncertainty added by f . 104 9.9 a note on P-values The standard procedure for a test is: • Choose H0 and H1 (they should be exclusive and exhaustive of the parameter space). • Choose a size of the error of the first kind: α. Be careful: too little α implies usually a big error of the second kind. Your choice should be based on a careful analysis of the costs implied in the two kind of errors. • Find a critical region for which the maximum size of error of the first kind is α and, possibly, with a sensible size of error of the second kind. • Reject H0 if your sample falls in the critical region and accept it otherwise. This procedure typically requires the availability of statistical tables. When computer programs for performing tests came to the fore two alternative paths were possible. Let the user input the α for each test and, as output, point out if the null hypothesis is accepted or rejected on the observed dataset with that α. Let the user input nothing, but give the user as output the value of α for which the observed data would have been exactly on the brink of the rejection region. This value is called the P value. With this information the user, knowledgeable about his or her preferred α, would be able to state whether the null hypothesis was accepted or rejected by simply comparing the preferred α with the value given by the program. If the researcher’s α is smaller than the value given by the program this means that the observed data is inside the acceptance region as would have been computed by the researcher so that H0 is accepted. If the researcher’s α is bigger than the value given by the program this means that the observed data is outside the acceptance region as would have been computed by the researcher so that H0 is rejected. Conceived for the simple reason of avoiding the use of tables, the P value has become, in the hands of Statistics illiterates, the source or numberless misunderstandings and sometimes amusing bad behaviors. The typical example is the use of terms like “highly significant” for the case of null hypotheses rejected with small P values or the use of “stars” for indicating pictorially the “significance” of an hypothesis. A strange attitude is that of considering optimal, a posteriori, a small P value which, a priori, would never be considered a proper α value. Sometimes, worst of all, the P value is considered as an estimate of the probability for the null hypothesis to be true given the data. A magnificent error since in testing theory we only compute the probability of falling in the rejection region given the hypothesis and not the probability of the hypothesis given that the data falls in the rejection region. 105 Please, avoid this and other trivial errors: the fact that such errors are widespread, even in the scientific community, does not make them less wrong. Exercise: under the hypotheses used for analyzing the t-test, find confidence intervals for single linear functions of β. 9.10 Stochastic X In applications of the linear model to Economics and Finance only infrequently we can accept the hypothesis of a non random X matrix. Typically the X shall be as much random (that is unknown before observation and variable between samples) than Y . Just think to the CAPM “beta” regression where the “dependent” variable is the excess return of a stock and the “independent” variable is the contemporaneous excess return of a market index which contains the same stock. If X is random, the results we gave above about βbOLS are no more valid, in general. Under the hypothesis of a stochastic matrix X we can follow many ways for extending the OLS results. Each of these ways means the addition of an hypothesis to the standard weak OLS setting. Here I choose a very simple way which shall be enough for our purposes. We shall extend the weak OLS hypotheses in this way: E(|X) = 0 V (|X) = E(0 |X) = σ2 In In other words we shall assume that, conditionally on X, it is true what we assumed unconditionally in the weak OLS hypotheses setting. It is clear that our new hypotheses imply the old one, not vice verse. Notice that the equality between conditional covariance matrix and conditional second moments matrix is true only because we assumed that the conditional expectation of is zero. (See by yourself what happens otherwise). An immediate result is that: E(Y |X) = E(Xβ + |X) = Xβ and this property justifies the name “linear regression” for our model. In other words, with a stochastic X and our added hypothesis our model becomes a linear model for a conditional expectation: a regression function. Let us see now if the OLS estimate is still unbiased in the new setting. E(βbOLS ) = E(β + (X 0 X)−1 X 0 ) = β + EX (E|X ((X 0 X)−1 X 0 |X)) = = β + EX ((X 0 X)−1 X 0 E|X (|X)) = β 106 Notice the use of the typical trick: the iterated expectation rule. Now compute the variance covariance matrix. V (βbOLS ) = V (β + (X 0 X)−1 X 0 ) = V ((X 0 X)−1 X 0 ) = = E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) − E((X 0 X)−1 X 0 )E(0 X(X 0 X)−1 ) = But the second term in the sum was just shown to be equal to 0 so: = E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) = EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) = Now the term E|X (0 |X) is, by hypothesis, equal to σ2 In so that: EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 X 0 X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 ) In short: with a stochastic X and the new OLS hypotheses, βbOLS is still unbiased but now its covariance matrix is fully unknown as it depends on the expected value of X 0 X. We conclude this section by just a hint at two results of standard non stochastic X OLS theory, one of which can and the other cannot (in general) be proved to hold in the new setting. First: with a suitable modification of the definition of “best” a Gauss-Markoff theorem still holds. Let us see how this works. In case of stochastic X, the theorem proof (using OLS hypotheses for the case of stochastic X) goes, now conditional on X, exactly as before. The last statement then becomes V (β̂|X) = σ2 ((X 0 X)−1 + CC 0 ) We must then find V (β̂). In the general case, this is NOT E(V (β̂|X)) as a second term depending on V (E(β̂|X)) should be considered. However, since E(β̂|X) = β (due to unbiasedness), this term is always equal to 0. We then have V (β̂) = E(V (β̂|X)) = σ2 (E((X 0 X)−1 ) + E(CC 0 )) This is the varcov matrix of the OLS estimate when X is stochastic plus the expected value of an at least psd matrix. If the second term is at least psd too, we have our proof. This is easy, we must show that for any non stochastic vector z we have z 0 E(CC 0 )z ≥ 0 but, by the basic properties of the expected value operator we have z 0 E(CC 0 )z = E(z 0 CC 0 z). We already know that CC 0 is at least psd, this is equivalent to say that, whatever z, we have z 0 CC 0 z ≥ 0. The expected value of a non negative number cannot be negative (why?) and we have our proof. 107 Second: the basic results underlying hypothesis testing do not directly hold in the new setting, except if we only work conditionally on X (which somewhat vanifies the scope of the new setting) and under the (strong) hypothesis that, conditional on X, the distribution of is Gaussian. However for all those tests whose distribution is independent on X when X is not stochastic, as, for example, the T −test, the assumption of conditional Gaussianity of given X implies that the standard result in the non stochastic X case still holds. More in general, hypothesis testing and the construction of confidence intervals can, in general, only be done conditionally on X or by relying on asymptotic results whose applicability in specific settings is always very difficult to state. For instance, the standard confidence interval for a generic βj contains in its specification (X 0 X)−1 jj which is stochastic if X is stochastic so that it shall be useful only conditionally to X, for the simple reason that if you do not know X you do not know the extremes of the interval itself (notice, however, that, since the probability of being in the interval does not depend on X this probability shall still be 1 − α for ANY realization of X so that the problem, if you work unconditional on X, is not that you do not know the probability of being inside the interval but that you do not know the extremes of the interval itself). 9.11 Markowitz and the linear model (this section is not required for the exam) There exists a nice connection between the OLS estimates and Markowitz optimal weights. Let’s recall Markowitz’s formula, where µR is the row vector of expected returns and µER the corresponding excess expected returns vector: w = λΣ−1 (µ0R − rf 1) = λΣ−1 µ0ER We are interested in deriving, using the OLS algorithm, a vector proportional tow or, at least, to the estimated w where in the place of the unknown parameters we have the corresponding estimates. We can then divide each element of this vector times the sum of the elements and have weights summing to 1. Let’s call R the matrix containing the n row k columns matrix on data on excess returns and µ = 10 R/n the k columns row vector containing the corresponding mean excess return. Now let’s start with the OLS normal equations: R0 Rβ̂OLS = R0 Y R0 R is not the estimate of Σ, however: nΣ̂ = R0 R − nµ0 µ 108 Let’s go back to the OLS normal equations and subtract to both terms nµ0 µβ̂OLS : (R0 R − nµ0 µ)β̂OLS = R0 Y − nµ0 µβ̂OLS That is: nΣ̂β̂OLS = R0 Y − nµ0 µβ̂OLS Now suppose Y = 1n : nΣ̂β̂OLS = R0 1n − nµ0 µβ̂OLS = nµ0 − nµ0 µβ̂OLS = nµ0 (1 − µβ̂OLS ) Where the 1 in (1 − µβ̂OLS ) is a scalar. It is then clear that these normal equations are a scalar multiple of : Σ̂β̂OLS = µ0 Hence, the solution of these equations shall be a scalar multiple both of the solution of the normal equations for OLS with regressors R and dependent variable 1 and of the Markowitz weights equation. Since they are both scalar multiples of the same vector once normalized dividing each term by the sum of their components we shall get the same normalized vector, in other terms: β̂OLS 10 β̂OLS = w 10 w We have just given a proof of the fact that the Markowitz weights are proportional to the OLS β estimate for the model: 1n = Rβ + That is: Markowitz weights are (proportional to) the OLS solution of the problem: “find the linear combination of the columns of R which best approximate a constant” (1, in our case, but any non zero constant will do as well). This should not be surprising as Markowitz weights minimize the variance of the risky portfolio. 9.12 9.12.1 Some results useful for the interpretation of estimated coefficients Introduction The proper use of regression, as of any statistical tool, requires both a correct understanding of its mathematical foundations and of its meaning. These two aspects are connected but they are not equivalent. A rather long experience in teaching this topic has induced in the author of these handouts a strong belief in the fact that the “meaning” part, not the mathematical foundations part, is the difficult one. 109 This section is dedicated to try and at least partially deal with this point. The section is by far longer than all the other sections dedicated to linear regression put together. This is a clear hint that our objective is not trivial. A word of warning is needed at the beginning of this (difficult) path: we shall try to shed some light on one important use of regression and, due to the limits of an introductory course, we shall only hint to another, maybe even more important, use. With some approximation we can say that a regression model is usually applied to two different problems: forecasting some variable on the basis of info on other variables and forecasting the “effect” on some variable of the manipulation of other variables. The term “regression”, as used in Probability and Statistics, has mostly to do with the first problem. The solution of the second problem, as we shall see, requires many assumptions which are extraneous to the sole fields of Probability and Statistics and deal with the core of the specific subject, in our case Economics and Finance, in which Probability and Statistics statements are made. It is very important never to mix up the two purposes . In what follows we shall dedicate a quite complete if simple analysis to the “forecast” use. We shall not enter in detail, but only hint to, the “effect of intervention” use which is, obviously, of the utmost relevance in Economics and Finance. Our choice is not due to the relevance of the first use wrt the second, but to the fact that the “effect of intervention” use would require, for a thorough exposition, a course in itself40 . Let us try and give some simple if approximate characterization of these two uses of regression41 . First use: forecasting. Given some information, that is: observed values of variables, we would like to forecast in a sensible way the value of other variables. This is obviously very useful for our decision making: to have an idea about future weather on the basis of observed meteorological data could be of some help in deciding where to pass our holidays; to have an idea about the possible returns in a population of stocks, on the basis This distinction and the debate between the two problems is at the core of the birth of Econometrics between 1930 and 1950. While many authors discussed the topic at the time, it is possible to single out the central role of the 1943 paper by Trygve Haavelmo: “The Statistical Implications of a System of Simultaneous Equations”, Econometrics, 11, 1, (Jan. 1943), pp. 1-12. This paper contains, in maybe sometimes very summarized but quite clear version, most of the arguments that would have been discussed during the following 75 years and of which a simple version is presented in these pages. A study of this paper, while not required for the course, would be very useful for any Student wishing for a more deep understanding of the role of statistical inference in Economics and Finance. Another important reference in the same time frame is the Cowles Commission series of monograph and, in particular, the 10th of these: “Statistical Inference in Dynamic Economic Models”, Wiley, (1950). 41 On this topics, for a view substantially similar to the one introduced here, see, e.g. David Freedman (1997) “From association to causation via regression”, Advances in Applied Mathematics, 18, 59-110. In this paper Freedman also consider a use of regression only hinted at in what follows: regression as summary of data. 40 110 of observed characteristics of these, could be of some help in deciding our portfolio composition; to have some idea about the probability of default of some company on the basis, say, of its balance sheet, could be of some help in deciding whether to invest in it. In each of these cases we may use observations of available data in order to compute forecasts (maybe expected values) of data we still do not know and take decisions. In all these examples, we want to forecast events which, at the moment, are unknown to us, on the basis of available information. Note: “forecasting” does not necessarily concern the future: it concerns, more in general, using available info in order to infer about still unavailable info. Suppose you must decide whether a suspect did commit a crime, we base our evaluation on available info (clues) today, while the event is in the past. In assessing the origin of a medical condition, we shall base our analysis on available symptoms. Many relevant statistical aggregates, e.g. GNP, employment, industrial production and so on, require years to be measured in a reliable way. Routinely, national statistical services produce estimates of these aggregates which are, actually, “forecasts” base on partial and incomplete information on available economic indicators. All these cases have two aspects in common: you have some information and use this to forecast “missing information”. You do this because, typically, you must act on the basis of this information. Notice that a common characteristic of the above examples is that your act is not likely to affect in any way the phenomenon you are analyzing: you get information and use it but do not in any sense “act” in a way that could change the available information or in any way affect the phenomenon you are studying. The first point is clear cut, the second a little bit more delicate. While it is clear that your decision about where to pass your holidays is reasonably NOT going to affect weather and your decision about the possible cause of a medical situation is NOT going to change the situation itself, your choice of investment, for instance, could have some “effect” on the performance of a given stock or company if you are, say, a big and authoritative investor. We could think of two kinds of effects. The first is straightforward: if the investor intervenes in the market on the basis of the forecast model, a sizable market action could have effect on the price of the stock. The second is more “strategic” and is typical of the Economics field: suppose the company knows about the forecast model and, for any reason, intends to give a “good impression” to the investor42 . This could imply Suppose, for instance, that asset selection models give positive weight to, say, a lower leverage. This could induce a company to alter its leverage in a way which would be economically meaningless had not the model been in place. Another example: risk management regulations set limits on some variables related, for instance, to the overall exposure of a bank to some category of risky asset. These rules where calibrated considering, also, historical data about the connection of these exposures and risk on the basis of a risk model. However, once the rules are acting, their simple existence could imply a strategic response on the part of the bank, altering its risk exposure in a way that could alter the ability of the model underlying the rules themselves to correctly forecast risk. 42 111 an action on the part of the company in modifying some of its characteristics. This action would not have been there, had not the model been used and could degrade the forecasting abilities of the model by altering the distribution of the firm observed characteristics wrt the one valid when the model was estimated. This line of thought, while relevant and quite interesting, would require a complex analysis well beyond the purpose of these introductory notes. Just keep in mind that the point is relevant43 . Leaving this delicate point aside, just consider this first use of regression as a “forecasting” use followed, maybe, by an act both of which (forecast and act) do not interfere with the phenomenon under analysis. In this setting, as we shall see, regression is a very good tool for forecasting (in a very precise sense of “good”) and, if properly considered, its use in forecasting is, under reasonable and very clear hypotheses, easy to apply and to understand. Moreover, and this is very important from the practical point of view, we shall find simple and reliable “a priori” measures of the quality we may expect the forecast to have. Let us now go to the second use of regression: evaluating the effect of intervention/policy/treatment etc. Sometimes we would like non only to be able to make forecasts given information, we would like to actually influence the variables/system we forecasts by “altering” the information. In other words: we are still looking for forecasts but, now, we want to forecast what could happen if we acted on some of the conditioning variables involved in the regression is such a way to choose their values. Consider the following example: we have data on the returns for the stock of several companies and data on, say, a set of balance sheet indicators for the same companies. If we observe a correlation between returns and indicators, e.g. a negative correlation between return and leverage, we could use indicators in order to forecast returns and, maybe, invest in lower than average leverage companies. This is a forecast use and, in order to work, only requires that the correlation is stable enough in time. If our One on the most interesting and most delicate aspects of Economics is the study of these strategic responses and how to take account of these in devising rules and policies. Notice that this is not something happening or considered only in the Economics field. One of the reasons why, in medical experiments, both treated and controls are treated in a symmetric way, such that nobody, patient or doctor, knows if what was given is treatment or placebo, is to avoid possible “treated/no treated dependent behaviours” which could alter the result of the experiment. 43 It should be clear, for instance, that such line of thought involves some specific definition of “causality” thru the idea of the “effect of an intervention”. Any use and discussion of the word “cause” implies dire problems which plagued science and philosophy since their inception and this, surely, is not the place to discuss these matters in any dept. Consider, however, that the common idea (for instance in Economics and medicine) of “measure of the effect of an intervention”: raise rates, assign a medical treatment, has very little to do with the notion of “causality” in classical Physics, where the word implies no intervention or action but the forecastability of a physica system future give its present state. 112 intervention in the market is not huge, it is reasonable to assume it shall not alter the forecast itself. However, we may also wish to use the info in a different way: we may be tempted to try and influence the stock return of a given company by altering its leverage. In doing this we are no more just observers of a phenomenon, we intervene by “deciding” the value of one or more variables involved in it. Suppose we do this and “forecast the effect” of our action, based on the observation on the “non tampered” phenomenon. It is clear that our action shall have any hope of getting results in line with this forecast only if some kind of “invariance to intervention” hypothesis, for the relations among the variables of interest when we observe them and when we act on them, holds AND if the consequences of our action on all the relevant variables used for the “forecast” is well understood. This second, very delicate, point is often subsumed in the idea that I can study the “effect” of my action on, say, a variable, as if this would not imply changes in other relevant variables: the “coeteris paribus” idea. It may well be that simple observational data, potentially useful for simple forecasting, shall be useless to this different purpose and that we should instead create novel ways to find information (e.g. experiments, when possible, or other “intervention friendly” ways). Consider some simple instances. Suppose we increase the leverage of our company because we observe that, in our dataset, by regressing the stock return on a set of balance sheet indicators including leverage, we get a positive estimate for the parameter of leverage. What we observe, the source of our estimates, are data on companies where, presumably, leverage is set at an equilibrium value compatible with the other variables characterizing the company. For these companies its is true (this is what we see in the estimates) that the expected return of a company with the same values for the other variables of another company but a higher leverage is higher. Useful from the point of view of forecasting. However, we are now altering one of these variables, leverage, without consideration for the rest, may put the company out of equilibrium, this may alter the relationship among the different variables and among these and the stock return. The result is difficult to assess and, probably, would require a much deeper analysis. In any case it could be completely different than what expected. Now, suppose the overall relations among the variables are not disturbed by our action. This not withstanding our action, by itself, may alter the value of other relevant variables. The “coeteris paribus” clause would not be valid and this is very likely in an equilibrium setting where variables must adapt to changes in other variables to maintain equilibrium. If this happens, what is relevant is the “net effect” of our action, keeping into account all interrelations among variables. This could be the opposite of what intended but, this the important point, to state the overall result would require an 113 analysis of the interrelationships among the different variables which goes well beyond a simple regression. Just as an example: in a regression with observed data, and dependent variable the firm stock returns, we may find a positive parameter both for leverage and, say, inventory to sales ratio. However, it may be that an (off equilibrium) increase of leverage has a negative effect on inventory to sales and the net effect on the expected stock return could be less positive then expected or even negative. A last point (which is, actually, a special case of the first). It may also be that the observed correlation between leverage and stock return is due to the fact that both variables are positively correlated with a third variable, observed or not. This does not alter the usefulness of the regression as a forecasting tool, but implies that any manipulation of the leverage shall be ineffective in changing expected returns. The information given by the tachometer is surely useful to evaluate the speed of our car (which we do not directly measure. However, it is not the case that, by moving the tachometer needle (old car, analogue meter) would not change the speed of the car, and, probably, would break the tachometer so that useful info would be lost. In this case both the speed of the car and the tachometer lever depend on the rate or rotation of the car’s wheel which depends, say, on the gear and the gas, for given conditions of the road. This simple example is not so far from economic intuition. In equilibrium the level of interest rates is a measure of the convenience to invest. If a country, maybe in recession, decides to lower interest rates, maybe intervening in the market to buy its debt, puts interest rates out of equilibrium. It may be that the resulting decrease of rates allows for more investments, however the new investments are such that they would not have been done in the previous situation. This could have been due to the scarce expected productivity of such investments. It is then questionable that “inducing” such investments may improve the overall economic situation. This example is simplistic but it may give an idea of the problems implied in an “intervention” analysis. Many other examples are possible and we shall go back on these points in what follows. What is to be kept in mind is that both purposes: simple forecast and “intervention effect measure”, are interesting and practically important but, while the requirements for the forecast use of a regression to be useful are quite simple, clear cut and, in summary, limited to the invariance of the distribution of observables, this is not true or simple when we are interested in forecasting effects of intervention. In what follows we shall give some basic, but rather complete hint, toward a correct reading of regressions as forecast tool and only hint at the “intervention” use with the purpose of giving some criterion to distinguish between the two. This is an introductory course and we cannot even try a summary of what is needed for a correct “action/intervention” analysis. For this reason, our sole purpose in this regard is to make as clear as possible that forecasting and intervention analysis are different and partially non connected purposes. This is a well understood fact since 114 the beginning of the study of Econometrics and, in fact, most of what distinguish Econometrics from other fields of application of statistical inference can be traced to different attempts toward dealing with these two purposes. To understand this is of paramount importance, in order to avoid PRACTICAL errors with DIRE PRACTICAL consequences. The plan of this section is as follows: We begin by a simple analysis of the basic properties of a regression function. We follow this by trying and characterizing, in some more detail, the difference between a forecast and an intervention. We conclude presenting a simple procedure for analyzing a regression (and in particular a linear regression) as a forecast tool. 9.12.2 Some property of the regression function (*only the definitions and properties are required for the exam, not the proofs of the properties. These are, however, a useful exercise) In what follows we shall always consider linear models as models of linear regressions. In other words, we shall suppose that a “set” of random variables (in general a matrix) Z (maybe only partially observable) has a probability distribution P (Z), that we want to study the conditional expected value (aka regression function) of one vector of Z, that we indicate with Y , conditional to a matrix of other elements of Z, that we call X (so that, obviously, we suppose such conditional expectation to exist) and that we suppose this conditional expectation to be expressed by E(Y |X) = Xβ, where β is a vector of non random parameters. It should be obvious that, out of a given Z, we could compute and be interested in many conditional expectations which we could derive from P (Z) and that, as already stated, we could be in the condition of not having complete knowledge of the elements of Z. From the point of forecasting the choice shall be based on what we know and what we wish to forecast in Z. Just to stress the different objective: in case of an intervention analysis this quite arbitrary choice would be totally unwarranted. While it is not strictly necessary, just for simplicity, we shall suppose, when necessary for simplifying results, all “interesting” regression functions to share the property of being linear. Let us recall some property of regression functions in this context (remember: in what follows we avoid mentioning mathematical details which could be quite difficult from the technical point of view, while almost irrelevant from the applied point of view. Hence, our results shall be “bona fide” versions of more complex and sometimes cryptic results). Most of these properties are already recalled in the appendix: 19 but it is useful to recall them here and comment on them a little more in dept. 115 We begin with general properties which only require the regression function to exist, be it linear or not. 1. Suppose Y = g(X). That is: Y is a function of X. Then E(Y |X) = g(X). 2. E(E(Y |X)) = E(Y ). In this you must understand the meaning of the two E signs. Each expected value is made wrt a different distribution (but both distributions derive from the same P (Z) otherwise the result does not hold). The “outside” E is wrt the distribution of X, that is P (X) the “inner” E is wrt the conditional distribution of Y GIVEN X. In general E(Y |X) is a function of X and for this reason you take the outside expectation wrt the distribution of X. This function is what is called “regression function” (of Y on X) and in our vector case this “function” is a “vector function” (or, better, a vector of functions). In the case the regression function is a constant (that is: it has the same value conditional to different values of X), due to the above result, this constant must be E(Y ). In this case we say that Y is regressively independent from X. Now let us reconsider the outside E. We said this expectation is wrt P (Z) but now we understand that what is inside, the regression function, is only function of X. To take the outside expected value, that, we only need P (X) which we could derive from P (Z) with the usual trick of “integrating out” or “summing out” (continuous and discrete case) the unnecessary variables. 3. E(Y − E(Y |X)) = 0 this is a corollary of the second property and the interpretation of the E. Here you see, and this is very useful for a good understanding, that the interpretation of the outside E as wrt P (Z) is relevant. In fact, the first obvious step in proving the result is: E(Y − E(Y |X)) = E(Y ) − E(E(Y |X)) and then, since the second property is that E(E(Y |X)) = E(Y ) we have the required result. This is correct but it implies that, when we write E(Y − E(Y |X)), the outside E is wrt P (Z) otherwise, if you could only compute E wrt P (X) it would be impossible for you to compute E(Y ). You can forget this and use the simple additive rule of the expected value symbol plus the first property, but if you really want to understand what you are doing it is better you consider the point at least once. 4. E((Y − E(Y |X))E(Y |X)0 ) = 0 (notice the ’ sign and recall we are here considering vectors and matrices). Let us do it step by step. E((Y −E(Y |X))E(Y |X)0 ) = E(Y E(Y |X)0 ) − E(E(Y |X)E(Y |X)0 ) Up to this point nothing new. Now concentrate on the first term of the difference and use the first property “in reverse”: E(Y E(Y |X)0 ) = E(E(Y E(Y |X)0 |X)) this seems to make things more complex but we should recall that E(Y |X) is a function of X so that (property one) E(E(Y |X)|X) = E(Y |X). Use this and get: E(E(Y E(Y |X)0 |X)) = E(E(Y |X)E(Y |X)0 ) where E(Y |X)0 is taken out of the inner conditional to X because it is a function of X. With this understanding we have the required 116 result: if you compute the covariance matrix of the “forecast errors” Y − E(Y |X) and of the “forecasts” E(Y |X) you get 0. You should be able to understand why we are calling E((Y − E(Y |X))E(Y |X)0 ) “covariance matrix” and you should realize that here we are using the terms “forecasts” and “forecast errors”. Now we are ready to state and prove a very simple result of paramount relevance. This result justifies the use of the regression function in order to “forecast” Y on the basis of X. Consider any (vector) function h(X) (the vector has the same dimension as Y ). Suppose you want to “forecast” Y using h(X) and you want this forecast to be “the best possible”. The regression function is a possible candidate as it IS a function of X (and call E(Y |X) = φ(X)) but there exist, in general, infinite other possibilities. You measure the “expected forecast error” “size” with its mean square error matrix44 : E((Y − h(X)(Y − h(X))0 ). We would like this to be as “small” as possible: this seems a sensible idea. However, this is a matrix, so that we must define “small” in a non trivial way. Here we follow the Gauss-Markoff definition and we look for a choice of h(X) = h∗ (X) such that any other choice would yield a MSE matrix whose difference from that corresponding to h∗ (X) is (at least) PSD, that is: E((Y − h∗ (X)(Y − h∗ (X))0 ) = E((Y − h(X)(Y − h(X))0 ) + H where H is PSD. We are about to show that the “best” way of choosing h(X), so that to make E((Y − h(X)(Y − h(X))0 ) the “smallest” in this sense, is to choose h(X) = φ(X). Begin with the identity: E((Y −h(X)(Y −h(X))0 ) = E((Y −φ(X)+φ(X)−h(X))(Y −φ(X)+φ(X)−h(X))0 ) = = E((Y − φ(X))(Y − φ(X))0 ) + E((φ(X) − h(X))(φ(X) − h(X))0 )+ +E((Y − φ(X))(φ(X) − h(X))0 ) + E((φ(X) − h(X))(Y − φ(X))0 ) We are now going to show that the last two terms of these sum are both equal to 0. They are one the transpose of the other, hence we can just prove the result for any one of the two. Take for instance the first one and use the first property, additivity and the second property, you get: E((Y − φ(X))(φ(X) − h(X))0 ) = E(E((Y − φ(X))(φ(X) − h(X))0 |X)) = 44 In general this is not the variance covariance matrix as we are not requiring E(h(X)) = E(Y ). 117 = E(E(Y |X)(φ(X) − h(X))0 − φ(X)(φ(X) − h(X))0 ) Now: since E(Y |X) = φ(X), inside the expected value we have the difference of two identical matrices, that is: a matrix of zeroes. In the end we have E((Y −h(X)(Y −h(X))0 ) = E((Y −φ(X))(Y −φ(X))0 )+E((φ(X)−h(X))(φ(X)−h(X))0 ) The first term in the sum does not depend on your choice of h(X). Consider the second. If we recall the definition of semi positive definite matrix, we see that ((φ(X) − h(X))(φ(X) − h(X))0 ) is PSD. This implies that also E((φ(X) − h(X))(φ(X) − h(X))0 ) is PSD.45 The “best” you can do, then, is to set this term to 0 by choosing h(X) = φ(X). Any other choice gives you a bigger, in the Gauss Markoff sense, variance covariance matrix. Summary of the result: the regression function is the “best” (in this particular sense) function of X if your aim is to forecast Y . Note: If, instead of E((Y − h(X)(Y − h(X))0 ), we decide to minimize E((Y − h(X)0 (Y − h(X))) that is: the expected sum of squared errors of forecast, a proof following the same steps as the above proof, just changing the position of the transpose sign, shows that the regression function minimizes the expected sum of squared errors of forecast. In this case the objective function is a scalar so the term “minimizes” has the usual sense. All these results do not require the regression function to be linear. We need linearity in order to state and prove the following important property. Suppose X1 is a subset of the columns of X and suppose (linearity) that E(Y |X) = Xβ and E(X|X1 ) = X1 G where I use an uppercase letter here (G) because X is a matrix so that E(X|X1 ) is a matrix of regression functions. We then have E(Y |X1 ) = E(E(Y |X)|X1 ) = E(Xβ|X1 ) = X1 Gβ. As stated above, given a choice of Y many regressions are possible if we condition Y to different sets of X. However, there shall be a connection between these regressions. This simple results allows us (for a linear regression) to compute the coefficients of the regression of Y on X1 when we know the coefficients of the regression of Y on X and X1 is a subset of X. A more general version of this result, the partial regression theorem, shall be discussed in what follows. This is because the expected value operator has the “internality property”, that is min(z) ≤ E(z) ≤ max(z) for any random variable z. Since the matrix ((φ(X) − h(X))(φ(X) − h(X))0 ) is of the kind P P 0 , this is a PSD matrix (whatever be the set of values X takes). This means that, whatever the vector of numbers v we have v 0 ((φ(X) − h(X))(φ(X) − h(X))0 )v ≥ 0. So, by the internality property E(v 0 ((φ(X) − h(X))(φ(X) − h(X))0 )v) ≥ 0 whatever be v, but v, while arbitrary, is non stochastic, hence we can take it out of the expected value operator and we get v 0 E((φ(X) − h(X))(φ(X) − h(X))0 )v ≥ 0 whatever be v. This is what we wanted to prove. Notice that by saying PSD we are not excluding the matrix to be PD, the point is that we only need to prove it is at least PSD. 45 118 9.12.3 Forecasting versus intervention We now know something more about the regression function. In particular we know that regression is, in an interesting sense, an “optimal” choice if we want to forecast Y when we know X. It is, therefore, not surprising that, when we are concerned with such a forecasting purpose, we liberally use it. To forecast, in this context, means simply this: suppose you observe the variables X on some subject/individual, you are interested in forecasting the value of another variable Y and this interest depends, as a rule, on the fact that the value of Y is unknown. If you want the forecast to minimize (in the sense considered above) a statistical measure of the error as E((Y − h(X)(Y − h(X)0 ) you are going to use as forecast E(Y |X), that is, if you know the functional form of E(Y |X) you simply input X in it and get as output the conditional expectation/forecast. You do not act in any way to choose, alter or set the value of X and any decision you take, as a consequence of your forecast, is supposed not to alter the regression function. It is to be stressed that it is not necessarily the case that Y follows in time X. A forecast is useful any time X IS KNOWN BEFORE (not: “happens before”) Y . For instance: X could be some statistical series easy to measure or observe, while Y may be much more difficult and costly to observe even if contemporaneous to X or even “happened” before X. In this case the use of E(Y |X) as a forecast of Y given the info on X is very useful in order to save time and money. For instance: the GDP variable in national accounts is, as a rule, first computed using a set of proxies, easy to observe, and then updated an made more precise during time (years) when more detailed info becomes available. In order for X to be useful to forecast Y , what is required is that E(Y |X) be a non constant function of X.46 While it could be of interest to understand why this may be the case, it is important to understand that, at a first order of approximation, if all we are interested in are forecasts, it is not so important to answer this question, we can just use this fact. A forecast is only about information, it is about using what we know in order to say something about what we do not know. Sometimes, but this is just a particular case, the “information linkage” between variables may have something to do with some causal (in any intuitive sense) connection between variables: “if I know the cause i should know something about the effect”. However, it is also true that if I know the “effect” I can say something about “Useful” here means V (Y ) > V (Y |X) that is: the knowledge of X “reduces the uncertainty” on Y . Recall the identity V (Y ) = E(V (Y |X)) + V (E(Y |X)) since both rhs terms are non negative, in order for V (Y ) > V (Y |X) it is necessary that V (E(Y |X)) > 0 that is: E(Y |X) must be a non constant function of X. 46 119 the “cause” (just think about how a MD or a police detective works deriving from symptoms/clues hypotheses on the medical condition/culpable). In conditional expectations terms, forecasting has nothing to do with the “direction” of such possible, but not necessary, causal connection: we can try and forecast the “effect” given the “cause” or the “cause” given the “effect”. The point is: what do we know, and what do we want to forecast. Let us go back to the example concerning the speed of a car. Let us say that the “true speed” of your car is that measured by a roadside Doppler radar while what we can observe is the car’s tachometer. If we suppose that both tools are well calibrated, and the conditions of the road reasonable, we expect both tools to give similar values to “the speed of the car”. You can then use any of the two in order to “forecast” the other and the forecast should be quite good (meaning: high R2 ). You choose which forecast to use on the basis of available info. As the driver of the car, you may be interested in the forecast you can make using your tachometer in order to avoid breaking speed limits. It is also clear that there is no “causal” connection between the two measures, at least not in the sense that, by altering the reading of one of the two instruments you can alter the other. If, for instance, you break the plastic on the instrument panel of the car and stop the tachometer arrow with our finger (we suppose analogue dials) this is NOT going to limit the scale of the radar measurement, and vice versa. In this case you have very good forecasts, provided you do not mess with the instruments. You can make forecasts conditioning both ways, according to what you know. But such forecasts do not imply any causal connection, at least in the sense that you could alter one measure controlling the other. Since we know what is happening, an economist would say: “we know the structure of the economy”, we know the reason for this. The two dials measure correlated phenomena. The radar measures the speed of the car wrt the radar itself, the tachometer measures the rolling rate of the tire. If the car is running on a reasonably non skidding surface (not on ice) the two should be highly correlated, hence our forecast ability. It is interesting to notice that, if we are only interested in forecasting, we may do without such understanding and only suppose the informative relationship is stable. In principle, under stability, we may forecast even if we do not “understand” is the sense that “we have no idea whence the correlation comes”. This informal idea of “stability” has several names in Statistics. In a very simple and constrained sense, it is called stationarity. The idea of i.i.d. random variables is a particular case of this. More in general the relevant idea is that of “ergodicity” which is quite beyond what we do in this course. We can go further: it is clear, in an intuitive sense, that the rolling rate of the tire “causes” the tachometer measure (even on skidding surfaces) and the Doppler radar measure (non skidding surfaces) in the sense that if I alter the rate, maybe acting on the gas pedal or on the brake pedal, I expect the dial of the tachometer and of the 120 radar to move in a precise direction. It is also clear that this is not true in reverse: I cannot speed up by moving with my finger or other tools any of the dials. This not withstanding, we are using the “effect” (the position of the dial) in order to “forecast” the “value” of the “cause” (the rolling speed of the tire). This perfectly sensible and is going to work, obviously, if I do not tamper with the dial. As mentioned in the introduction of this section, to use “effect” in order to “forecast” “causes” is a quite common procedure. Consider a case where “cause” is in the past while “effect” (as usual) is in the future. The information we have about extinct living beings comes from their fossilized remains which are available today. We can say something about the shape and behaviour of extinct living beings “conditioning” on the information we can derive from what today is a fossil. However, in no sensible meaning, fossils “cause” the existence in the past of now extinct living beings. I may destroy all fossils today, this would be very foolish and would not alter the past of life on our planet. Maybe it would alter our understanding of this past and this could be the objective of the (reasonably mad) fundamentalist/paleo-terrorist involved in the destruction. This, however, is another story. Once we understand that a forecast has only to do with an information linkage between variables under which there may be or not causal relationship, we may consider a second point and try to shed more light on the difference between simple forecasting and attempt to intervene on the result of a phenomenon. When you compute E(Y |X) for a given subject for which you observe X, a set of variables, you simply put the observed X in the function φ(X) = E(Y |X) and get your forecast. If X is made of many different measures, there is no much interest in measuring the “contribution to the forecast” of each variable in X. To make things simple, suppose Y is a single variable and X is made of X1 and X2 , suppose E(Y |X1 , X2 ) = α + β1 X1 + β2 X2 where, for instance α = 0, β1 = 1 and β2 = −1. This simply means that, if in a subject you observe X1 = .4 and X2 = 1 your forecast for Y in that unit is -.6, and if you observe, in another unit X1 = .4 and X2 = 2 your forecast is -1.6. You can surely say that, if you consider two different subjects with the same, say, X1 and two different values of X2 where the difference in the two values is 1, the two forecasts shall different of β2 , that is of -1 but this cannot be read in the sens that, if in a unit you “change” the value of X2 increasing it of 1 than you are going to get a forecast -1 smaller than before. This is wrong for many reasons. Consider the tachometer example above: let us say that to change X2 means to alter the position of the dial with your finger. It should be foolish, in this case, to use as forecast function the conditional expectation computed by observing data where the dial is not tampered with. If I actually did the experiment of comparing the speed as measured by the radar with that of the non tampered and 121 tampered tachometer, I would see that, while a change in the untampered tachometer dial corresponds (with some approximation) to a change in speed as measured by the radar, this does not happen if the dial position is altered because it was tampered with. This is totally obvious but contains a very important teaching: there are ways in which, if I “change” some variable in the system I observe, the informative role of such altered variable changes wrt the role it has in the untampered system. For this reason, by itself, the observation of the untampered system, while useful for forecasting, cannot tell me anything, in principle, about the”effect” of me tampering with the system. If our notion of “cause” is based, as usually is in Economics and Finance, on the idea of “intervention”, we can simply say that forecast on the basis of information (be it made using regression functions or other tools) in general tells us nothing about any causal relationship. This is even more evident when, as in the case of fossils, we observe X a long time after Y did happen. While the observation of a fossil fish may be the indication that, in the past, a sea existed where the fossil was found, while the observation of a fossil sloth would imply that, in the past, some forest/savanna environment existed where the fossil is found, if I swap the fossils I cannot expect to alter the past environments. Again, this is obvious and, being obvious it should be always in the mind of any researcher using regression. In this section we did see that a regression is always (under some stability hypothesis) useful for “forecasting” and that “forecasting” has not necessarily to do with “cause” but only with information. What we are going to discuss in the following sections, is a simple but complete way to read regressions as forecasts. If, before doing this, you wish for some more hints the business of intervention, read the following section (not required for the exam). 9.12.4 Some quick point about intervention (not for the exam) While we are not considering, here, measures of causal effects, it could be useful to dedicate some lines to this topic in order to at least understand how much more difficult is this kind of job, even in the case we limit ourselves to a “cause” definition based on the idea of “intervention” setting an arbitrary value to a variable in a system. This idea of “measuring the effect of an intervention”, with its implied optionality, owes much to fields the like of medicine, agricultural and biological experiments, based of statistical experiments and randomization. Notice that this has, for instance, nothing to do with the concept of “causal” in classical 19th century Physics. In that setting “causality” had to do with the fact that, in principle (at least this was the belief at the time), once the initial conditions of a system of particles were known (positions and momenta), the future evolution 122 of the system was fully determined by these and the equations of motion. In this setting an “intervention” is simply the perfectly determined results of an inevitable interaction between two perfectly determined systems (in other words: there is no possible “choice”). Let us reprise, in a more detailed way the stylized example example regarding interest rates. It could be and it is of interest to study the connection between the level of interest rates and, say, economic growth. However, we also are very much interested in the “effects” of monetary “policy”. We are interested, for instance, about the possibility of controlling growth by “manipulating” rates. Are these two purposes (forecast and study of the effects of policy intervention) equivalent? This is by no means evident. Suppose the economy is let to “work by itself” and we do not intervene. Interest rates and growth are jointly determined, and neoclassical Economics supposes this connection to satisfy some equilibrium condition (various exist) involving many other observable and non observable quantities, in principle all those quantities that are involved in individual choices which, on their turn, are modelled as those of optimizing individuals possessing some utility ranking for the different possible results and aiming to the best possible result according this ranking. The way such an equilibrium is reached is, as a rule, not discussed but an axiom of the theory is that “nothing happens if not in equilibrium”. The evolution we observe in the economic system is a continuous change from an equilibrium position to an equilibrium position induced by external factors (resources, technology) and, possibly, evolution of preferences. This implies that, when we observe changes in variables, these changes always satisfy the equilibrium axiom. In our example, we may understand this “equilibrium” as described by the joint probability distribution of interest rates and growth (and, arguably, many more variables). We may suppose this distribution to be known. Or maybe, under some condition, we may estimate this from data. From our knowledge, or from observation we may, for instance, infer a negative correlation between growth and interest rates. The meaning of this, if we intend just a marginal correlation with no further conditioning, is that, we did, or should observe that, on average, while the economic system transits from different equilibria, higher rates are accompained by lower growth and vice versa. This is equivalent to say that we can forecast higher growth when we know that interest rates are lower and higher rates when we know growth is lower. Which of the two forecasts (rates with growth, growth with rates) we choose to 123 make depends on our purposes and on what we know. In any case the forecasts shall “work” if the correlation is stable and “work well” if it is high. Does this imply that I can, say, increase growth by setting low interest rates? In general no or, better, we can say nothing on this point on the basis of the sole knowledge of observed correlations. The question itself is puzzling: if everything is determined in equilibrium, how can I choose the value of any variable. Am I, in some sense, outside the system, as, say, is Nature (who chooses, say, crop yield and so determines an evolution of the economy thru different equilibria), or am I in the system but, in some sense, I can set some variable to a value that is not the equilibrium one and expect the system to reach a new equilibrium with this added constraint? (There is and was much discussion about this point). Whatever the interpretation of the question the reason why an answer is impossible on the sole basis of the observed (equilibrium) correlations is quite obvious. The info I used to make forecasts came from the free working of the “economy”. If I intervene and change rates, whatever the interpretation of my intervention, I am tampering with this autonomous working and it may be that I substantially alter it. In principle I should change the name itself of the “interest rate” variable when this is set arbitrarily by me. It used to be “free equilibrium interest variable” now it is “intervention interest rate”. This change of name, by itself, could be useful in that it clarifies what is non obvious in the use of, say, the regression function for growth conditional to free equilibrium interest rate in order to evaluate the “effect” of setting a particular value of the “intervention interest rate”. Are we doing something like pushing the gas pedal to increase speed, or are we trying to increase the speed by acting on the tachometer dial? Or maybe are we even changing the structure of the car as a system and maybe altering it to a point where it breaks down? It is important to understand is that we cannot decide this on the basis of the simple observation of the “natural” correlation of interest rates and growth, where no “action” on interest rates is in development. This “natural correlation” is very useful if we try to forecast one of the two variables using the other under “natural conditions”. It could be completely useless both for causal effect measurement AND for forecasting if we intervene on one variable. It may even be the case that we simply cannot intervene, even in a perfect causal setting: there is correlation between temperature and the position of the sun wrt the horizon but we cannot alter this position in order to have an effect on temperature. There is correlation between age and income, this allows us, e.g. to forecast income for any given age and, as a consequence, to estimate the possible path of income for a given individual while time passes. But we cannot alter or intervene on age. And so on. 124 Let us go back to “natural correlation”. A way to understand this is as follows. Suppose Z is the set of “relevant” economic variables (observable or not). Let P (Z) be a valid probabilistic description of the “economy left to itself”. Any intervention could alter it in the sense that the probabilistic description of Z when one or more variables are acted upon could be a different joint probability, say, A(Z). The difference could be of many kinds, some less and some more relevant. One very relevant difference would be the case that the regression function under intervention be different than the “forecasting” regression functions, so that observations coming from “the world of P (Z)” could be of very little use to forecast some Y when we intervene on some X. Notice that this could be also true in reverse: if your data comes from intervention it may be you cannot use this for estimating forecasting regression function. There is another important problem. Suppose that P (Z) is not altered by the intervention or, at least, the intervention does not change the regression function, that is E(Y |Xintervention ) = E(Y |X) and suppose that X is not only informative on Y but is also “cause” of Y (this is somewhat implicit in the assumption E(Y |Xintervention ) = E(Y |X)). Recalling properties 2 and 4 of the regression function, the errors of forecasts have expected value 0 and are uncorrelated with the forecasts if the forecasts are made using the regression function, in particular property 2 implies that forecasts done with regressions are “unbiased”. Now: we are here considering E(Y |X) but, obviously, there may be many other observable or unobservable variables, say Z, useful to forecast Y . In general, for a given value of X, E(Y |X, Z) depends on Z and so is NOT equal to E(Y |X). Now suppose that your choice of action, Xintervention depends on some of these Z. In Economics, for instance, it is not frequent the case that you choose policy by random assignment: you intervene because there is some problem (values of some Z you do not like), in medicine you do not treat a patient randomly, you intervene because of symptoms. If this happens, however, it may be the case that, while E(Y − E(Y |X)) = 0, property 2, and E((Y − E(Y |X))E(Y |X)0 ) = 0 (property 4) it is not true that E(Y − E(Y |Xintervention )) = 0 and E((Y − E(Y |Xintervention ))E(Y |Xintervention )0 ) = 0 and this is due to the concomitant “action” of Z. In other words: even if we are in the best possible setting (there exists a causal relationship, not just an informative relationship, and action does not change it) we would still be unable in quantifying the policy outcome by the sole knowledge of E(Y |X). This implies a word of warning about the following, otherwise reasonable, suggestion: instead of using as data all available observations on interest rates and growth, just limit your sample to those data points where rates were in fact determined by policy actions. This would in fact be reasonable if we did think that such policy act were not themselves at least partially induced by variables, observed or not, which by themselves have an influence on growth. If this is not the case what we do observe 125 would not be connected with the “effect” of policy only, but also with the effect of such (maybe unobservable) variables. The result could then be of little help, as we cannot be sure, if we do not model the relationship, about which, for the policy action under consideration, would be the value of such variables47 . Consider the following example in a different field. In medical experiments the standard procedure to test a new treatment is to select a population of subjects with a given medical problem the treatment should help with, say: high blood pressure. The population is randomly dividend in two parts and to one of these the treatment is assigned while to the other a placebo is given. Neither the patient nor the doctor knows whether placebo or treatment was actually given. Why? Because the knowledge of being treated or not could alter the behaviour of the patient in such a way as to “confound” the effect of the treatment. For instance: a patient, knowing that all he got was a placebo, could be induced to stay on a diet or to follow other therapies. At the opposite, a treated patient, knowing of having been treated, could decide to spend his time eating hamburger and fries. It is clear that, in this case, the result of the experiment would not allow for a measurement of the effect of the treatment but, maybe, of the effect of the full procedure of being in an experiment, being given a treatment and knowing about it. Would this be useful for assessing whether to use or not the treatment as an approved cure for high blood pressure? Now, suppose the experiment is performed in such a way that the patient does not know whether the treatment of the placebo was used. In this way you could measure the specific effect of the treatment. Suppose, now, you want to translate this result into the effect of the treatment in real world conditions. In real world conditions, doctors do NOT randomly treat patients and patients know about the treatment. Patients are treated because they feel ill, go to the doctor, and the doctor gives treatment. Is the “overall effect” of the treatment still going to be comparable with the experimental result? As you can see, lots of interesting problems and questions exist. In fact, in the last century lots of research has been done on these topics and we know today many results and procedures useful, case by case, to try and deal with them. The analysis of these problems is at the core of the “statistical experiments” literature in medicine, biology and similar sciences since the beginning of 20th century and of the “Econometrics” movement in the Economics field during the nineteen twenties. Solutions exist but, in general, these require either observations coming from some controlled version of the phenomenon, or very strong, untestable, assumptions on the data, or a mix of both. In fields like agriculture, medicine, biology and similar, researchers try and study A(Z) by actually intervening on the variables in some controlled way via statistical 47 Some more detail on this point can be found in Appendix 21 126 experiments. The basic hypothesis here is that the distribution of Z under experimental conditions, say E(Z) be, at least for the relevant regressions, similar to the “in vivo intervention” distribution A(Z). In these field it is often the case that P (Z) has very little meaning as we are interested in measuring “the effect” of variables which do not even exist “in nature” (as most medical treatments). Still, when intervention is done “in vivo” that is, not in an experimental setting (and this is the final purpose of research), the problem is to relate the experimental result (governed byA(Z)) with the actual in vivo result (governed by E(Z)). In fields like Demography, Economics or Astrophysics, statistical experiments are either impossible (we cannot create a new universe, cancel a star, we cannot change the age of people while we can kill lots of them but, hopefully, would not consider this an acceptable procedure) or practically irrelevant (because we cannot assume E(Z) to be similar to A(Z)). In particular, in the case mentioned above: interest rates and growth, we could, at least in principle, try and mock an experimental setting by, say randomly assigning interest rates to different countries or different production units. However, apart from the cost of this, it is difficult to expect that rational units would react to this “strange mess” in any sense similar to what they would do under actual changes of monetary policy. Moreover, since monetary policy induced interest rate changes are done for specific purposes in specific economic conditions, it is quite unreasonable to expect that the “effect” of randomly changed interest rates be in any way similar to changes of interest rates with specific purposes and under specific conditions. In general in Economics we shall most frequently use data governed by P (Z) (observational data) and hypotheses on how different could be A(Z), or at least the relevant regressions under intervention. (This is called “structural modeling”). In this course we shall not consider this point which is, obviously as we did try and show, of the utmost importance. However, in discussing the examples at the end of this section (with more details in the we shall have some opportunity of delving a little further about this point. In this section we tried to clarify why, while a “forecast” reading of a regression is always reasonable, a view based on “forecasting the effect of an action” or, simply, a “causal” reading is by no means reasonable in general or even possible without recourse to a much more complex analysis and richer set of a priori assumptions. For this reason here we are only concerned about how to read a (linear) regression in a forecasting setting. This shall be in any case useful and, in particular, shall also be useful also when it may be the case that the regression has a causal interpretation. We shall not deal further about how and under which conditions such an interpretation can be upheld. 9.12.5 Reading a probability model: a two level procedure Let us begin with a general principle. 127 A good reading of a linear model, as of any statistical model, should begin by splitting the procedure in two parts. First: understand the meaning of a regression coefficient independently on statistical inference. That is: suppose all parameters in the model are known and identical to the estimated values and learn how to read these. Here what you need to understand is the meaning of the regression model itself, without the added burden of parameter estimation. Second: evaluate what in fact you really know, since parameters are only estimates. That is: introduce a measure of sampling variability. The difficult part is the first one. The second part is easy (if we understand the meaning of sampling variability and, more in general, if we understand statistical inference). The two aspects should NEVER be mixed and the subsidiary role of the second w.r.t. the first, at least from the point of view of interpreting results, must be understood. This is not something new or specific to the linear model setting. Whenever we use a statistical model we must first understand the meaning of the model, without bothering on problems of parameter estimation, once we are done with this we can try and see how the fact that some parameter in the model is unknown and must be (and hopefully can be) estimated influences our perception of the model result. A simple example: r1 and r2 are daily returns of two stocks. We assume them to be jointly Gaussian with expected values µ1 and µ2 , standard deviations σ1 and σ2 and correlation ρ. We suppose to have observations on a (bivariate) time series of these returns (n data points) and that observations in this time series are i.i.d. Due to this, we can estimate the 5 unknown parameters with the “usual estimates” and get the values .001 and .002 for the expected values, .01 and .01 for the standard deviations and .2 for the correlation. It is clear that these are NOT the true values of the parameters, just estimates, they can change in other samples and we can quantify this potential changes by measuring their sampling variability. This not withstanding, let us follow the suggestion above and see what are the implications of the model when these values are seen as the true values of the model parameters. There is no such a thing as “general” model implication: when we speak of “model implication” we must have in mind a specific practical problem. Let us say that our problem is to build a simple long/short position by going long one stock and short the other for the same amount. We must choose which stock to go long and which to short. Mind: this “one stock short against one stock long” idea is NOT sound in general, we should also find the right “hedge ratio” for the long/short position, which could be different than one to one. Moreover, our understanding of diversification effects tells us that it would be preferable to work with two portfolios, not with just two stocks. This is not a lecture on “pair trading” (the practitioner’s name of such a position), 128 just an example. If we simply look at the numbers, we see that we should go long the second stock and short the first: same standard deviation but the second has double the expected return. Fine, but what can we expect from our investment? In order to answer this question let us study the random variable r2 − r1 which, albeit indirectly, describes the economic result of our investment.48 Using what we know about Gaussian random variables we see that r2 − r1 is distributed according to a Gaussian with expected value µ2 − µ1 and variance σ12 + σ22 − 2ρσ1 σ2 . Taking the above estimates as true parameter values we have an expected value of .001 and a standard deviation of .0126. With these data it is easy to compute the probability of a positive return, over one day, of our long/short position. This probability is .5315. This is better than .5 (flipping a fair coin) but, still, it does not seem so exciting, even if the expected return of r2 is double the expected return of r1 . Obviously, the decision about what to do is up to the trader, however think about the different sound of two possible descriptions of the trade. “The expected return of the long position is double the expected return of the short”. “The probability of a positive result is .5315”. As an exercise, consider the same position over different time intervals and the same position with different values of the parameters. Now the inference part. The numbers we used above are not true parameter values but estimates. As such we must quantify their sampling variability. It is easy to compute V ar(µ̂2 − µ̂1 ) = (σ12 + σ22 − 2ρσ1 σ2 )/n, here n is the number of observations (days) in the sample. The variances and the correlations are unknown, in general, but for the sake of simplicity let us suppose that they are instead known and equal to the values above. √ This gives us a 95% confidence interval for µ2 − µ1 equal to .001 ± 1.96 ∗ .0126/ n. Suppose you have 1000 observations (roughly 4 years of data). In this case the extremes of the confidence interval are [.000216; .001784] and using standard terminology we say that the difference between the two expected returns is “significant” at the 5% level (i.e. the 5% confidence interval does not contain the 0). Does this negate the above result? The answer is: not, both if you consider the position interesting and if you consider the position too risky. The “significant” result here only means that, even if the difference is small, in the sense explained above, with enough observations (n big enough) we can distinguish it These are log returns and so we should study: er2 − er1 which is the economic result, per unit, of our investment in one time period (day). Mind: being a long short position of initial value 0 we cannot speak of “return” of the position. We study r2 − r1 just because it is simpler and this is only an example. 48 129 from 0 even if we take into account sampling variability. This adds nothing to the properties of the position discussed above, which were analyzed supposing that the estimates actually correspond to true values. A more complete and interesting analysis of this result, which compounds the two steps, could be performed but it is not in the scope of these handouts. What should be avoided, if ever considered, is the following reasoning: since the difference between the estimates is “statistically significant” this implies that we should without any other consideration suggest taking the long/short position. And think how much of an advertising you could get by simply adding, as before, “and the expected return of the long position is double that of the short”. All this would not change the fact that “the probability of a positive result is .5315” (again, this computed supposing the estimates to be the true parameter values). In conclusion: inferential procedures, the like of confidence intervals or significance tests, cannot render “relevant” probabilistic results that are not considered so even supposing that parameters are known. Splitting the analysis in two steps: first assume estimates are true parameter values, second discuss the precision of the estimates, can help you avoiding such mistakes. As an exercise you should show a case where, due to the small sample size, a result which could be considered practically relevant if parameters were known, is put under scrutiny by the size of sampling variability. Statistical inference is a tool for estimating parameters in a probability model and assessing the amount of sampling variability. Statistics is NOT a tool for evaluating the meaning or the importance of the results we get applying the model, such a meaning always depends on the understanding of the model and of the phenomenon the model describes. What Statistics can do is just to offer measures of how much we can say about the values of the parameters in the model on the basis of our sample. 9.12.6 Understanding a linear regression model as a forecasting tool. The central role of R2 A linear model of the kind Y = Xβ + is not always describing a (linear) regression. It does describe a regression if we assume, in some way, that E(Y |X) = Xβ. So, for instance, if we do not suppose E(|X) = 0 the model is still a linear model but we are NOT interpreting the model as a regression. It could, then, be very interesting to analyze the nature of a linear model when it does NOT model a regression, but we shall concern ourselves only with the case what the linear model is the model for a regression. A regression is, first and above all, a conditional expectation of one random variable, given other random variables. 130 In our case we consider E(Y |X) or, better (row by row) E(yi |xi ) and in particular we consider the linear case where we suppose E(yi |xi ) = xi β. (Here yi and xi are the i − th rows of Y and X). A regression is an optimal forecast of a variable given other variables. We now know that “optimal” here means “minimizing mean square error”. Let us recall the result:, a regression is a (vector) function of xi : E(yi |xi ) = φ(xi ) such that it minimizes the “mean square error”: E((yi − η(xi ))2 ) over all the functions η(xi ). As we did stress, this is very general and does not require the regression function to be linear. Given xi (be it a single variable or a vector of variables) there is no better way, in the mean square error sense, to “forecast” yi than using the regression function. Since the purpose of a regression is to minimize the mean square error, it should be of interest to know how much such mean square error has been minimized in each specific case. We are in the setting of linear regressions, in the empirical version of a linear regression (using data and not theoretical distribution) to minimize the MSE becomes to minimize the sum of squared residuals. This is equivalent (when the intercept is included) to maximize the R2 . This is the meaning of the variance decomposition result from which we derived the R2 .49 Why is this discussion of optimal forecasts relevant for understanding the results of a linear model? Since a regression is a way to make forecast by minimizing a measure of error the first, and always valid, “reading” of a regression must be first of all based on summarizing ”how good” this minimization was. In the context of linear regression this implies a first, simple, question: “how big” is R2 . The second question, usually is, “with this R2 , is the regression relevant”. The term “relevant” creates many problems as, clearly, the answer shall depend on the specific context. For this reason, while researchers do propose, e.g. reference values for R2 under which a regression should be considered irrelevant, we suggest here a more cautious path which tries to merge an “absolute” evaluation of the R2 with a more specific connection to the specific practical context of the analysis. We shall discuss this point further on with examples. However we cannot and do not stop here. It is almost always the case that we look for some further “decomposition” of the “explained variance” in terms of each single “explanatory variable”. The third step in the analysis requires to find ways to measure a variable by variable measure of relevance. We should be discuss the difference between the theoretical and the empirical variance but this is not of much relevant here. 49 131 This is a fully legitimate question, if correctly understood. It is also dangerous because, if not correctly intended, it is borderline to a “causal” and possibly wrong, reading of a regression. We shall be able to correctly understand a regression, if we shall be able to precisely set the bounds under which this question has a meaning and if we shall be able to answer to this question from within these bounds. This measure, too, shall be based on the R2 . The Reader should notice that the value of R2 depends on the full joint distribution of Y and X. If this changes, maybe because the conditions change between the observation of different samples, or because the observation of the estimation sample is made under different conditions than that of the sample whose values we want to forecast, then the conclusions of the analysis may change (in principle, the regression itself may change). This is an important point we do not have time to fully consider in these handouts. Something more on this topic shall be discussed in section 9.12.12. In what follows we shall be careful to apply a lot of “statistical restraint”. In fact there exist many “simple” answers that would beso beautiful and direct, if they were true. The problem is that wishful thinking is not, as a rule, good Statistics. To path to a correct answer begins with the understanding of the following fundamental results. 9.12.7 The partial regression theorem. Partitioning R2 There exist two versions og the partial regression theorem. They are very similar because the proof is based on the strong mathematical similarity between two completely different objects: frequencies and probabilities. We first prove the “frequency based” version, that is: the partial regression theorem valid for OLS estimates. The second version has to do with “theoretical” regression functions, that is with probability and can be seen as a direct application of the law of iterated expectetions to a linear regression. While quite obvious in terms of proof, the partial regression theorem tells us something which, maybe, is a priori unexpected: any given coefficient in a linear regression is NOT a derivative with respect to the corresponding variable, in the common sense of the term. In fact, what a coefficient in a linear regression really is, is something of much more interest and to understand this is fundamental in order to correctly interpret the result of a regression. Theorem 9.4. The estimate of any given linear regression coefficient βj in the model E(Y |X) = Xβ can be computed in two different ways yielding exactly the same result: 1) by regressing Y on all the columns of X, 2) by first regressing the j − th column of 132 X on all the other columns of X, computing the residuals of this regression and then by regressing Y on these residuals. Proof. Write the model as Y = Xj βj + X−j β−j + where you isolate the j − th column of X in Xj and put the rest in X−j . To make things simple suppose the intercept is in X−j . You estimate it with OLS and get: Y = Xj β̂j + X−j β̂−j + ˆ. Now write the auxiliary regression: Xj = X−j γj +uj and estimate it with OLS to get Xj = X−j γ̂j + ûj . Substitute this in the original OLS estimated model: Y = (X−j γ̂j +ûj )β̂j +X−j β̂−j +ˆ = ûj β̂j + X−j (γ̂j β̂j + β̂−j ) + ˆ. By orthogonality of ûj with both X−j and ˆ we get P P P P P i )ûij = i û2ij β̂j so that β̂j = i Yi ûij / i û2ij i Yi ûij = i (ûij β̂j +Xi,−j (γ̂j β̂j +β̂−j )+ˆ which, since the mean of ûj is equal to 0 (X−j contains the intercept) is identical to the OLS estimate in a regression of Y on ûj alone. A similar result is directly valid for the regression function (if we suppose all regressions to be linear, otherwise a similar but more general result is valid). The result is valid without considering estimates, but directly properties of theoretical linear regression functions. In this case the statement of the theorem becomes: Theorem 9.5. Any given linear regression βj in the linear regression E(Y |X) = Xβ is identical to the coefficient of the regression of Y on Xj − E(Xj |X−j ) if we suppose E(Xj |X−j )to be linear. Proof. The proof mimics the proof based on estimates of βj and goes as this: E(Y |X−j ) = EXj |X−j (E(Y |X)) = X−j β−j +E(Xj |X−j )βj = X−j β−j +X−j γXj |X−j βj = X−j (β−j +γXj |X−j βj ) and E(Y |X) = X−j β−j +Xj βj = X−j β−j +(Xj −X−j γXj |X−j )βj + X−j γXj |X−j βj so EX|Xj −X−j γXj |X−j E(Y |X)) = E(Y |Xj −X−j γXj |X−j ) = (Xj −X−j γXj |X−j )βj . You should notice that, in this proof, regressions are required to be linear while the proof concerning estimates only requires that the estimates to come from the use of OLS in linear models. Notice, moreover, that the proof with estimates is based on the algebraic properties of OLS estimates: weak or strong OLS hypotheses are not required. In practice, the only property used in the proof is that of orthogonality (with intercept included) which directly comes from OLS. This result, in both versions, is relevant because it immediately implies that each βj is not connected with some “relationship” between the j − th column of X and Y , but only between the part of the j − th column of X which is (linearly) regressively independent on the other columns of X and Y . In other words: the linear regression model does not measure, in any sense, the “effect” of a given column of X. Whatever the definition of such “effect”, this has 133 only to do with the part of this column variance which is uncorrelated with the other columns. The meaning of a regression coefficient for the same variable depends on which other variables are in the regression and both the coefficient and the meaning change if we change the other variables in the regression. While a regression may have a causal interpretation, this is by no means necessary or even commen. It is then important, when we speak of “effect” to avoid the impression of speaking in causal terms. We shall then define and measure the “effect of a variable” in a regression for what it is and for what it is implied to be in the partial regression model. For us, this is just the marginal “effect” or “contribution” of a column of X in reducing the mean square error or, equivalently, improving the forecast performance, when the other columns are accounted for in the sense of the above theorem. This “effect” is to be better understood in “informational” term as the ability to improve the quality of a fit and, it the inferential extension of fit to forecast is justified, in terms of quality of forecast. When the intercept is in the model, this is measured by the increase of R2 you get if you add the Xj column to the model, or, equivalently, by how much the R2 decreases if you drop such column from the model. This quantitative measure, specific to Xj as used in a regression with a GIVEN set o other variables: X− j, is called: “semi partial R2 ”. 9.12.8 Semi Partial R2 We define this as the “marginal” change of R2 due to each column in X after accounting for the other columns. This seems to imply that its computation is, if not difficult, quite long: run the regression of Y on the full X, compute the corresponding R2 and then drop in turn each single column in X, one at a time, and measure the corresponding change in R2 . This is not only long to do, it could be impossible if we are evaluating regression results as we read them we read on a paper or a report, as we should need to work on the original data to perform the computations. It is quite frequent, at least in the social sciences milieu, the case where even the overall R2 is not even reported. We have a way out of this which can almost always (in simple OLS setting) be implemented. A “folk” and simple result of OLS algebra (we give it here without proof, but see further on for a proof, not required for the exam, in a footnote) allows us to determine the marginal contribution of each column in X to the R2 even if we only know the T −ratios for the single parameters and the size n of the data set. 134 Lemma 9.6. Suppose we are using OLS and we drop the column Xj from the matrix X. The decrease of the overall R2 corresponding to the dropped column (call this 2 )is equal to t2j (1 − R2 )/(n − k). Where: t2 is the square of the T −ratio for R2 − R−j the added variable, n is the number of observations and k is the number of columns in the full regressors matrix. This is called the “semi partial R2 for Xj and is nothing but the R2 of the regression of Y on the residuals of the partial regression of Xj on X−j . Here the T −ratio is assumed to be computed with the standard formula we gave in the section about OLS. An interesting point in this result is that it allows us to “recycle” a quantity which we considered as just a measure of statistical reliability, as a useful way for reading the R2 . This is just an algebraic result, that is: it is valid in any sample and does not require either weak or strong OLS hypotheses to be valid. It is just a numerical identity. Beyond allowing us to compute the semi partialR2 , this result has other interesting implications. As we stressed before, with a big sample size it is very difficult for a T −ratio not to be “statistically significant” as even a very small number can be distinguished from zero if the sampling variance is small enough (the denominator of the T -ratio is divided by the square root of n − k). However, when n − k is big, it is quite possible that the estimate could be “significant” while, at the same time, the relevance of the variable could be totally negligible, as the added contribution of the variable to the explanation of Y variance could be negligible. Suppose you have, say, n = 10000 (not uncommon a size for a sample in social sciences and in Finance). Suppose you have 10 columns in the X matrix and the T −ratio for a given explanatory variable is of the order of 4 so that the P −value shall be of the order of .01: “very” statistical significant! True, but the above lemma tells us that the contribution of this variable to the overall R2 is at most (that is: even for an overall R2 very near to 0) of, approximately 16/(10000) that is: less than two thousands. Hardly relevant from any practical point of view! If I drop the variable the overall explanatory power of the model drops of way less that 1%. Another way to see the same point is this: how big should be the T −ratio, under the previous hypotheses, so that the marginal contribution of the regressor to the overall R2 is, say, 10%? From the above formula we have that the T −ratio should be of the order of 32 (the square root of 1000 is about 31.62). This makes even more clear the fact that, in general, “statistical significance” (that is: the estimate is precise enough hat we are able to distinguish it from 0) and “relevance” (here measured by the contribution to he forecasting power of the model as measured by R2 ) are very different concepts. It is now important to understand how semi partial R2 works across different columns of X. If we compute this quantity for each column of X, we measure the marginal contribution of each of these column in “explaining” (frequently used but not so correct term) 135 the variance of Y . As stated above, here “marginal” means: how much the introduction of the variable improves the R2 when the other variables are already all in the model. This means that, each time we compute this quantity for a different column Xj , the “other variables” left in X−j are different. For this reason, while we can split the overall R2 in a part due to the introduction of a new variable and a part already “explained” by the other variables, we cannot add in any meaningful way the semi partial R2 of different variables except in the case when all the columns of X are uncorrelated (not very interesting in our field). Summary up to this point: a regression is a forecast which minimizes the mean square error (if it is linear with intercept it minimizes the variance of the error. This is the purpose of a regression and it should be evaluated in so far the quality of the forecast is sufficient to our purpose (the specific purpose is going to enter in the evaluation). While in a forecasting setting there is no big role for speaking of “the effect” of this or that column of X, it is possible (partial regression theorem) to define the marginal contribution of each column of X to the overall R2 . The quantitative measure of this contribution is given by the semi partial R2 . 9.12.9 Further discussion about the “effect” of a variable in a forecasting setting When we speak of “contributions to the R2 ” we are not speaking of the contribution of single specific values of a column of X but, obviously, we are considering the full variance of the column. Whoever reads a paper containing results of a linear model is, almost unavoidably, treated with sentences of the kind “a change of 1” or “a change of one standard deviation” in the variable xj “is going to have as effect, on average, of a change of y of...” and here, typically, you find βj times the change in xj . As we discussed above, this is, implicitly, a “causal” or “intervention” interpretation. Strictly speaking, in the simple “regression as forecast” context, such a statement has no meaning. In a regression used as a forecasting tool, we do not attribute changes of values in the column Y to changes in this or that value of the column Xj . Our interpretation is, first, in terms of how much of the full variance of Y “can be forecasted” using our knowledge of all (columns and rows) of X and, second, in terms of the marginal contribution of each column Xj . Again, and sorry for the number of times this is repeated, but it could be useful: this is done without any causal interpretation be or not such interpretation available. So, is it possible to give any meaning, outside a causal interpretation, to statements the like of: “a change of 1” or “a change of one standard deviation” in the variable xj “is going to have as effect, on average, of a change of y of...”? Obviously, it may be such a statement, trivially, is only intended to mean that, 136 considering E(Y |X) = Xβ as a linear function of X, the elements ofβ can be seen as derivatives with respect to each xj . This is trivially true and trivially useless, by itself. A derivative times the change of a variable xj measures the change of E(Y |X) only if this change in xj happens without altering any other value in X or the same functional form of E(Y |X) which is exactly what we cannot guarantee without further hypotheses. In forecasting terms this statement boils down to the obvious sentence: “if I have two different rows of X where the only difference between the rows is in the value of Xj then the difference between the forecasts shall be βj times the difference in the two values of Xj ”. This is arithmetically true but not very useful. As already stated, a real solution of this problem has to do with a “causal” o “intervention” interpretation of a regression, that is, of the study of conditions where I can actually interpret a βj as a measure of the “effect” of a change in one variable “keeping the rest constant” that we can actually implement. What we can do is to use the semi partial R2 in order to measure the specific contribution of all the rows of Xj to the overall R2 that is: the usefulness of each column of X in the forecast of Y . This is useful when our purpose is limited to forecasting and, obviously, shall also be useful when we are in the condition of performing and “intervention” analysis (which we do not consider here). From this, one must not infer that the “value of βˆj ” is by itself not relevant in a forecasting setting. Actually, it is possible to show that such value, even in a forecasting setting, has something to do with a ratio of standard deviations in a way that echoes (but in a correct way) the (wrong) interpretation of βˆj of the kind: “how much Y changes if I change Xj of one standard deviation”. Let us begin with a useful result connected to the semi partial R2 . Consider the regressions of Y on X−j and of Y on both X−j and Xj Y = X−j α + 0 Y = X−j β−j + Xj βj + . Proceeding in a similar way as in the partial regression theorem, regress Xj on X−j . Substituting Xj = X−j γ̂ + Û in the second model you get Y = X−j (β−j + γ̂βj ) + Û βj + Since Û and X−j are uncorrelated, the estimate of β−j + γ̂βj coincides with α̂. Hence Y = X−j α̂ + Û β̂j + ˆ Hence, V ar(Y ) = V ar(X−j α̂) + V ar(Û )β̂j2 + V ar(ˆ). 137 Notice, moreover, that V ar(Û ) is, by definition, equal to V ar(Xj |X−j ) (it is, in fact, the variance of the residual in the regression of Xj on X−j ). We then have that the overall R2 of the regression of Y on X can be written as: R2 = (V ar(X−j α̂) + V ar(Û )β̂j2 )/V ar(Y ) As a consequence, the increment in R2 , when going from the first model to the second one, that is: the semi partial R2 , is 2 R2 −R−j = t2j (1−R2 )/(n−k) = V ar(Xj ) 2 V ar(Û ) 2 V ar(Xj |X−j ) 2 2 β̂j = β̂j = (1−RX ) β̂ . j X− j V ar(Y ) V ar(Y ) V ar(Y ) j We have, then: q p ˆ | V ar(Xj |X−j )|β j 2 2 p tj (1 − R )/(n − k) = V ar(Y ) That is: the square root of the semi partial R2 for Xj is, in units of the standard deviation of the data on Y the change in the conditional expectation of Y given by a “reasonable” change in Xj , reasonable when the “other X−j are kept constant”, hence measured with the CONDITIONAL standard deviation of Xj GIVEN X−j .50 50 While V ar(Xj |X−j )β̂j2 V ar(Y ) is a “direct definition” of semi partial R2 for Xj as this is the difference V ar(Xj |X−j )β̂ 2 j between the R2 with the full X and the R2 with X−j , the identity = t2j (1 − R2 )/(n − k) V ar(Y ) rests on a lemma we did not prove. A simple proof of this lemma (what follows is not for the exam!) is as follows and, as it should now be not surprising, is based on the partial regression theorem. Since β̂j can be estimated by regressing Y Pn Pn on the residuals of the regression of Xj on X−j we have β̂j = i=1 yi (xij − Xi−j γ̂Xj X−j )/ i=1 (xij − Pn Xi−j γ̂Xj X−j )2 and this can be written as β̂j = i=1 yi (xij − Xi−j γ̂Xj X−j )/(nV ar(Xj |X−j )). The sampling variance of this is V (β̂j ) = σ2 nV ar(Xj |X−j )/(nV ar(Xj |X−j ))2 = σ2 /(nV ar(Xj |X−j )). )/(n − k))/(nV ar(X |X )). Now The estimate of this is V\ (β̂ ) = σ̂ 2 /(nV ar(X |X )) = (nV ar(ˆ j j −j j −j recall that V ar(ˆ ) = V ar(Y )(1 − R2 ) so we have that t2j (1 − R2 )/(n − k) = β̂j2 (1 − R2 )(n − k)nV ar(Xj |X−j )/(V ar(Y )(1 − R2 )(n − k)n) = β̂j2 V ar(Xj |X−j )/V ar(Y ) and we have the proof. All this depends on the fact that V (βˆj be estimated with the usual OLS formula. This require the Hypothesis of uncorrelated residuals with constant variance. In case residuals are correlated, and in particular, correlated within groups of data, a frequently used estimate for the variance of the estimate of βj is the “clustered estimate”. This tends to be bigger than the OLS based one and, by consequence, t-ratios tend to be smaller. A reasonable rule of thumb (see, e.g. Brent F. Moulton (1986) “Random group effects and the precision of regression estimates”, Journal of Econometrics) gives an increase of the estimated variance which should be smaller than n/q where q is the number of “clusters” in the data (hence, n/q is the average size of groups). This p implies that the “clustered” t-ratio should be approximately equal to the Standard OLS one times n/q}. If we call tc such “clustered” t-ratio, the above formula for the semi partial R2 should be amended to t2cj (n/q)(1 − R2 )/(n − k). However, in this case, a direct evaluation of the semi partial R2 is always possible using two OLS regressions. The approximation is useful if, improperly, info in semi partial R2 is not offered by Authros. 138 2 Moreover, sinceV ar(Xj |X−j ) = (1 − RX )V ar(Xj ), we can write: j X−j 2 ) t2j (1 − R2 )/(n − k) = (1 − RX j X−j V ar(Xj ) ˆ2 β V ar(Y ) j or, equivalently V ar(Y ) t2j (1 − R2 )/(n − k) = βˆj2 2 (1 − RX )V ar(X ) j j X−j 2 Notice that (1 − RX ) is frequently used as a measure of how correlations in j X−j the matrix X affect each Xj and is printed in regression output (usually as an option) under the name “Tol” for “tolerance”. This is because, if it happens that the Tol for any Xj is 0 or near 0, implying Xj linearly dependent, or almost linearly dependent, on X−j , we have that X 0 X cannot or almost cannot be inverted due to collinearity. The idea of a first assessment of the “relevance” of a variable in a regression by computing “how much the conditional expected value of Y changes if we change Xj of one unit of standard deviation” and then compare this with the empirical standard deviation of Y , is, as we just said, common even if in general unjustified as it implies a causal interpretation of the regression. From the above computations we see that, there is a reasonable and correct version of this argument, in the forecasting setting, in terms of semi partial R2 . We can summarize this in several, equivalent, ways: 1. The square root of the semi partial R2 for a given column of Xj is the ratio between the fraction of the standard deviation of Xj that is not correlated with the other columns of X and the standard deviation of Y times the absolute value of βˆj . 2. The absolute value of βˆj is the ratio between the fraction of standard deviation of Y “explained” by Xj alone (the square root of the semi partial R2 times the variance of Y ) and the amount of standard deviation of Xj that is uncorrelated with the other columns of X. 3. βˆj in absolute value is the ratio between a “reasonable” movement of Y (one sigma) and one “reasonable” movement of Xj conditional to the other columns of X (one CONDITIONAL sigma) multiplied by the square root of the semi partial R2 . 4. The square of βˆj is the ratio between what “is to be explained” (the variance of Y ) and what is “left in Xj to explain Y ” (the conditional variance of Xj ) times the fraction of the variance of Y “explained” by Xj given the other columns of X (i.e. the semi partial R2 ). 139 9.12.10 Again: “statistically significant” VS relevant We have offered some hint of how to interpret a regression when parameters are known. In all standard applications, β is not known and must be estimated. This is the “second step” in interpreting results we mentioned above. What changes? Actually not much, the problem is only to quantify how much we really know about β, since we can only estimate it, and it is easy to do this by the study of its sampling variability. Although this should be simple (and in fact it is simple) it may create some new interpretation problems we discussed above, when we compared “statistical significance” with “relevance”. It is an empirical observation that, while the pitfalls and interpretation errors we are going to summarize here are warned against in most good books of Statistics, this common advice seems to work in some empirical field while it is almost of no consequence in other. It is usually the case that the effect of such warnings is bigger when empirical analysis has real practical purpose and smaller when the main reason for empirical analysis is more “paper publishing” oriented. In empirical Economics and Finance the main question has to do with the concept of “statistical significance”. In the standard statistical jargon an estimate of a parameter is “statistical significant” if its estimated value, compared with its sampling standard deviation makes it unlikely that in other samples the estimate may change of sign. In the standard regression setting, the most frequently used statistical index is the T − ratio and an estimated βj has a “significance” which is usually measured in terms of its P −value of the T −ratio. Does a small P −value imply that a parameter is “relevant” in any sense, except the fact that it is estimated well enough so that its value, in other possible samples, should not change sign? As already discussed, the answer is “absolutely not”. We already commented on this when considering the semi partial R2 . There is an even more striking way to present the point: suppose the parameter is known and is different from zero (so that its P −value is 0: it cannot be more significant than this!) the actual relevance of the corresponding regressor could be absolutely negligible if the semi partial R2 is small. Here, by relevance, we intend the ability of the corresponding Xj to “explain” an amount of variance of Y which is big w.r.t. the total variance of Y but similar statements are true if you measure “relevance” in a more complex causal setting. “Statistically significant” only means that the statistical quality (precision) of the estimate is such that the estimate should not change sign if we change the sample. In iid samples, if n is big, typically all parameters become “statistically significant”. 140 √ This because the sampling standard deviation decreases at speed n, so that even a practically negligible βj can be estimated with enough precision so to allow us to distinguish it from zero. In no way this implies βj to be “relevant” in any practical sense. What happens here is that, with n big enough, we can reliably assess that an irrelevant effect is actually irrelevant. It is frequent to see published papers in major journals where linear models with tens of regressors and tens of thousands of observations result in statistically significant coefficients with an overall R2 in the range of few percentage points and semi partial R2 of fractions of 1%. Whatever the notion of “relevance” (forecasting, always available, or causal, requiring many hypothesis), it is difficult to conceive of any practical setting where such results could be termed “relevant”, if not because they give relevant support to the statements of “irrelevance” of the corresponding effects. This would be not so important, if the same papers did not spend most of their length discussing about the meanings and the practical relevance of the effects supposedly “found”. This misunderstanding between “statistical significance” and “relevance” must be avoided. If models were used for practical purposes (say for forecasting or controlling variables) the misunderstanding would quickly disappear: an estimate can be as significant as I like but, if the R2 is small, the quality of the forecast shall be awful all the same. When models are only used for academic purposes (appear in published papers) the misunderstanding may continue unscathed, sometimes with hilarious consequences. Summary: first assess the relevance of the regression and the parameters of interest in terms of explained variance as if parameters were known and not estimated. Then look at the statistical stability of the results. An irrelevant parameter is still irrelevant if it is “significant” while a parameter which could be relevant can be put under discussion if its sampling variance is too big. 9.12.11 A golfing example In the following example we try to determine how much of the average by tournament gain of the most important competitors in the 2004 PGA events can be captured by a linear regression on some ability indexes and other possibly relevant variables. The dependent variable: AveWng, is the average gain. The columns of X are: a constant, Age=the age of the player AveDrv=the average drive length in yards DrvAcc=the percentage of drives to the fairway 141 GrnReg=the percentage of times the player reaches the green in the “regular” number of strokes AvePutt=the average number of putts per hole (should be less than 2) SavePct=the percentage of saved points Events=the number of events the player competed in We have some expectations for the possible two way correlation between these variables and the average winnings but, since the regression estimate measures a kind of joint dependence, it is really possible that such expectation, while reasonable, do not apply to regression coefficients. In particular it is reasonable to assume that expected average money is positively correlated with AveDrv, DrvAcc, GrnReg and SavePct, negatively on AvePutt, while we do not have an a priori on Age and Events. By the way: in this example no direct causal interpretation is reasonable. Let us start with some descriptive statistics and a simple correlation matrix: From this correlation matrix we see that at least one of our expectations are apparently not true: correlation with driving accuracy is negative. However we also see that, and this could be expected, correlation between AveDrv and DrvAcc is rather 142 strong and negative (longer means riskier). We’ll see that this has an interesting implication on the overall regression. Let us now run the regression: Do not jump to the coefficients! First read the overall R-square. It is .45 that is: 45% of Y variance is due to its regression on X. It is important to keep this in mind: anything we’ll further say about coefficients, partial R square etc. lies within this percentage. More that 50% of the dependent variable variance is not “ruled” by its regression on X. Another way to read the same result is to compute a confidence interval for the (point) forecasts. As seen above this interval is given by q h i 0 0 −1 xf β̂OLS ± z(1−α/2) σ (1 + xf (X X) xf ) It could be shown that, for n−k not to small, X 0 X with a determinant not too near to 0, and a xf not “too far” from the observed columns of X, this can be approximated by h i xf β̂OLS ± z(1−α/2) σ Under the same hypotheses we can freely put σ̂ in the place of σ and still use the Gaussian in place of the T distribution. With this approximation, the point forecast 143 interval is the same for each xf and its width is 2z(1−α) σ̂epsilon . If we allow for the plus/minus two sigma rule this, with our data, becomes 4 times 41432 or, forecast plus/minus 2 times 41432. If we stick to the Gaussian hypothesis (or believe a central limit theorem can be applied in our case) this interval should contain the true value of yf with a probability of more than 95%. If we go back to the descriptive statistics, we see that the standard deviation of AveWng is 54990. This means that, without regression, our forecast would be the same for each observation and equal to the average (46548) and the corresponding forecast interval would be 46548 plus/minus 2 times 54990. With the regression our forecast is xf β̂OLS , so it varies with the observations on xf , and this variability “captured” by the regression, is “subtracted” from the marginal standard deviation so that the point forecast interval shall be narrower: the point forecast plus/minus 2 times 41432. You should notice that, with an R2 of about .45, the width of the forecast interval is reduced only of less that 1/6th. This is not surprising: the R2 is in terms of variance while the interval is in term of standard deviations. Variances (explained and unexplained by the regression) sum, standard deviations do not (the square root of a sum is not the sum of the square roots). For this reason the term “subtracted” above is put under quotes. We may then question the statistical precision of our estimates, in particular the statistical precision of our R2 estimate. In the output we do not have a specific test for this but we have something which is largely equivalent: The F −test tables. The F −tables imply rejection of the null hypothesis that there is no regression effect, meaning: all parameters are jointly equal to zero with the possible exception of the intercept. Notice that, with few observation, even a sizable R2 as what we found could fully be due to randomness. The F −test tells us that this does not seem to be the case. This is not a direct “evaluation” of the statistical precision of our R2 estimate. However, implicitly, since there exist a direct link between the value of the F −test and R2 , it tells us that an estimate of R2 as the one we found is very unlikely, if there is no regression effect. From the point of view of forecasting, this is all. We may like or not the results but this is what we find in the data and, if we just suppose some “stability” of the model (see the comments above) this is the precision of the forecast we can make. What follows can be seen as an “anatomy” of the forecast in terms of each column of X. This can be useful for forecasting use but, obviously, it is much more relevant if the setting is such that we are able to hold a causal interpretation of the regression. If we go to the last column of the regression output (we added this to the standard Excel output) we find the semi partial R squares. We see that only three variables have a sizable marginal contribution on R2 as measured by their semi partial R2 : GrnReg, AvePutt and Events. This means that these are the variables whose addition to X most improves the forecast. Can we go a little bit further and say that, barring for the 144 Events variable we shall comment further on, an increase of GrnReg and a decrease of AvePutt are the aspects of the game that, if improved, would imply a greater and more reliable increase in AveWng? This is a causal interpretation, is it reasonable in our setting? We cannot exclude this, we can only say it is very unlikely to hold. Why? The data is the summary of a season. It describes a set of “ability” indicators for each player and some other variable. Let us concentrate on the abilities. Let us take, for instance, Age. This is a typical variable you cannot intervene on. This not withstanding, the variable changes in time. The possible causal interpretation would then be: each year the conditional expected values of gains goes down by almost 600 dollars. Is this the “effect” of age? Even if we do not consider that what we observe is a cross section of players and not a time series of results for a single player (so that we may observe the action of Age), we must answer “beware”. If a causal interpretation was possible, the β of Age times a change of Age would be the expected change in AveWng if all the other variables are constant, that is: if only Age acts and the golfer’s abilities as expressed by the other variables do not change. Is it reasonable that the natural evolution of Age does not change the other abilities of any given player? Quite unlikely. In any case this point should be assessed by theory and empirical data in order for a causal interpretation to be possible (the methods for doing this are not object of this course). Let us now consider a variable on which we can think we could “act”: AvePutt. We cannot arbitrarily set this to a lower number (compare this with “changing interest rates”) but we may conceive of increasing the time dedicated to putting green training. If this reduces the number of putts, even of just 1/100 we should improve (change of the conditional expected value) our wins “on average” of almost 700 dollars (69000 times 0.01). Is this the “effect” we can expect? It depends. Golfing is an equilibrium game. What counts is the overall result and trying to improve a part of the game may have bad results on other parts of the game. By training more on the green maybe we worsen (or maybe improve?) our game under other points of view: length, precision from distance etc. Moreover: the model was estimated on a sample of players with a given “equilibrium mix” of abilities. Is it still going to be valid if we alter such characteristics? Again: we do not know this and, with no answer, any attempt to use the model in this sense would be unwarranted. Notice that here we did hint to three different problems, the same problems we hinted at a number of times above. 145 The first is that it can be difficult or impossible to act on an Xj and, at the opposite, some Xj is bound to change by itself . The second is that it may be difficult to intervene on one Xj without altering other Xj -s and, if this happens, we should model this interaction to have an idea about the “effect” on the dependent variable. The third is that any action on one or more Xj could alter the conditional expectation itself and we should model this alteration. All these problems have been discussed and are being discussed by econometricians. In fact, as we mentioned, these problems are at the origin of Econometrics and are still the central problem of Econometrics: what makes Econometrics sister but not a twin sister of Statistics. Following the general approach of this section we do not further develop develop the “causal” discussion and, for a moment, suppose that we can improve our AvePutt decreasing it without altering other variables or the regression function itself. Is it reasonable to assume an improvement of 1/100 if we suppose this does not alter the other indicators? Since we see that AvePutt is correlated with the other variables, this cannot be more than an approximation. However, if this correlation is not too big, it may be that the reasonable values of AvePutt, conditional to the other variable to be constant, have a standard deviation which is big enough so that it allows for “changes” of 1/100 in AvePut. The marginal standard deviation of AvePutt is about .023. Notice that this is a standard deviation across the players, so it does not directly concern our problem. 1/100 is less than one half of the marginal standard deviation of AvePutt. This means that is quite easy to find different players with such a difference in this statistic. With a little bit of unwarranted logic, let us assume that this is true for the single player, if we do not condition to his other statistics. This is the crucial point: both for different players and for the single player we must recall that we are within a regression and that we are evaluation the possibility of changing AvePutt of 1/100 while the other variables do not change. This means that, as stated above, we must consider the conditional standard deviation not the marginal standard deviation. Recall the formula p q ˆ | V ar(Xj |X−j )|β j p t2j (1 − R2 )/(n − k) = V ar(Y ) Using the data in the output we find that the standard deviation of AvePutt (our Xj ) conditional on the other variables (X−j ) is .021, obviously smaller that the unconditional variance but still more that 2 times the hypothesized change of .01. This implies that, even conditionally to the other variables, different players (and maybe the same player) still could easily show such different values of AvePutt. 146 For the above mentioned reasons, this does not justify, by itself, a causal interpretation. However if such an interpretation were available, an expected effect of the size of 700 dollars (69000 times .01) or even more would not be unreasonable. On the other hand an improvement of, say .04 in AvePutt would probably be unlikely both marginally and, what is more important for us, conditionally to given values of the other X−j . If this causal analysis is viable, then, we may expect that a work on the putting green which does not alter the rest of “the game” could give a golfer a reasonable improvement of 700 dollars in the AveWng (roughly 1.5% of AveWng). Let us now consider other aspects of the estimates. A possible puzzling point is given by the sign of AveDrv and DrvAcc which are both negative. The semi partial R2 of AveDrv is almost 0 while that of DrvAcc is a little more than 2%. In most practical contexts we could then avoid discussing the estimate of the parameters for these variables. As an exercise, however let us try to use what we know about partial regressions to unravel the puzzle. Begin by comparing the simple correlations with AveWng and the signs of the parameters estimated in the linear regression. Notice that the sign of the parameter for DrvAcc is the same as its correlation with the dependent variable while the sign of AveDrv is negative with a positive correlation. A negative simple correlation between AveWng and DrvAcc may not be surprising and we may try an explanation, which, as always in these cases, is implicitly based on some causal interpretation to the parameters. The possible interpretation is this: it could simply be that, in order to be precise with the drive, a player tends to be too cautious and this may harm his overall result. There are many alternatives to this interpretation, each depending on some strand of causal reading of the parameters. The choice among these depends on further and more complex analysis and on more structured hypotheses about how the performance of the golfer is connected to each of the statistics. On the plus side for a forecasting interpretation. Now let us consider AveDrv, the correlation of this variable with AveWng is positive and not small, while the regression coefficient estimate is negative and the semi partial R2 is virtually 0 (much smaller that the same statistic for DrvAcc whose correlation with AveWng, in absolute value, was roughly 1/2 of that of AveDrv). To understand what is happening let us consider the result Here we see the result of the regression of AveDrv on the other columns of the X matrix: 147 60% of the variance of AveDrv is captured by its regression on the other columns of X More that one half of this (38% semi partial R2 ) has to do with its negative dependence on DrvAcc. Also GrnReg show a sizable semi partial R2 (14%) and a positive regressive dependence. As we know, only what is left as residual of this regression is involved in the estimation of AveDrv parameter in the original regression. This is the part of AveDrv variance which is not correlated with DrvAcc and GrnReg (and the other variables in the partial regression). We know that GrnReg is the single most important variable in the overall regression (in the sense that it shows the highest semi partial R2 ). Based on this we may attempt an interpretation (again: many are possible): the “equilibrium player” represented by the regression tends to have an higher AveWng if the percentage in GrnReg is higher. On the other hand, a higher percentage of GrnReg tends to imply a bigger AveDrv. For this reason, marginally, AveDrv is positively correlated with AveWng. However the AveDrv in excess of what correlated with GrnReg seems to be harmful to the overall game and, from this, the negative coefficient in the overall model. Now compute, as we did above, the conditional standard deviation of AveDrv. √ According to our formula this is equal to 0.0000812 ∗ 54990/94.76 = 5.23 to be compared with a marginal standard deviation of 8.27. If we hypothesize a change in this variable (conditional to the other columns of X) equivalent to that hypothesized 148 above for AvePut (less that 1/2 of its conditional standard deviations) equal to 2.5 the overall expected “effect” should be a decrease of AveWng of roughly 200 dollars. You would need a very big change of twice conditional standard deviation (about 10) to have a negative effect comparable with an AvePut change of 1/2 conditional standard deviation. Again a matter of care: this evaluation are borderline causal! In the end, what would be the most proper use of such a regression? Suppose you want to bet on how much on average a randomly chosen player is going to win. You know the characteristics of the player, you are betting on the results. The estimated regression would be a nice starting point. Now change players into stocks, winnings into returns and use market returns, price to book value, size, and so on as indicators as in the Fama and French model or in the style analysis model. In which stock would you invest? To which fund manager would you give your money? This are clearly relevant questions and the regression model would be fit for these even without any causal interpretation. 9.12.12 Big/small partial R2 and relevance A last relevant consideration: as we have seen in order for a variable to “explain” a big chunk of the dependent variable variance it is necessary (not sufficient) that this variable has some variance left when regressed on the other explanatory variables. We also did say that this is a rather generic statement and that a more precise analysis should be led case by case. Now we must also stress that the analysis considered here considers as “given” the joint distribution of X and, by consequence, the conditional distribution of each column of X given the other columns. In this setting, an analysis of the “relevance” of a variable in a regression based on the semi partial R2 is well justified. This seems the most relevant case in an observational setting as the setting most common in applications in Economics and Finance. Suppose, on the other hand, that there is the possibility that, keeping constant the regression function E(Y |X) = Xβ, the joint distribution of Y and X may change (perhaps because it is “acted on” by some policy decision or, simply because of any new circumstances). In this case, in general, the overall R2 and each semi partial R2 would, in general, change. For simplicity, just consider the “univariate” model yi = α + βxi + i . Under the hypothesis E(yi |xi ) = α + xi β. In this case the R2 and the semi partial R2 of x are the same and are R2 = β 2 V (x)/(β 2 V (x) + v()). To fix the ideas, suppose β = .5 V (x) = 1 and V () = 10. In this case R2 = .25/10.25 = 0.024. Whatever the interpretation of the regression (forecast or causal) it seems that the role of x, while existing (we know that β is not 0), is not so relevant 149 (at least in terms of a “good fit”). But suppose the, with no change the regression, either the variance of x becomes higher or the variance of decreases or both, that is: suppose the joint distribution of y and x changes, for some reason51 . For instance, suppose V (x) = 100. If all the rest remains unchanged the new R2 shall be equal to 25/35=0.72. In this case the contribution of x to the quality of the forecast of y becomes considerable. Is this reasonable, is it relevant? For instance: in a observational setting, can we suppose that the data we use for the forecast are so different w.r.t. those used for the estimation? In a causal setting: is it possible such a big alteration of the behaviour of x (and in a multivariate regression: is such a change possible CONDITIONAL to the other columns of X)? This, obviously, cannot be assessed in general and can only be evaluated on a case by case basis. The important point is, again, to fully understand how, even a simple and standard method like linear regression can never be dealt with in a ritual/cookbook way. Only a full understanding of the method and of the circumstances of its specific application can (and does) yield useful results. Barred this, its use can only be understood as kowtow to pseudo scientific ritualism or, worst, mislead rhetoric. Let us consider a case in which a variable may have a “relevant effect” even if it does NOT explain a big chunk of the dependent variable variance. Suppose for instance that you have a dataset where observations ore on the heights of a population of adult men and women. The sample is very unbalanced and it contains, say, 1000 men and 20 women. For this reason most of the observed variance in height shall be due to variance across men. If we regress heights on a constant and a dummy which is equal to 1 of the subject is a woman we shall find, with all likelihood, a statistical significant negative parameter for the dummy (something like -10 centimeters) but an almost zero r R2 . This does not mean that the difference in height between men and women is irrelevant, it is, but that, due to the fact that most of the sample is made of men, this difference does not explain a big chunk of the variance of THIS sample: most of the variance in this sample is not due to sex, but to variance in height among males. Now, suppose you apply this result to a balanced sample where 50% of the subjects are women and 50% are men. In this new sample most of the variance shall come from the different sex. In other words: we do forecast in a setting where the distribution of X is quite different w.r.t. that valid for the estimation sample More in general: it may be that the role of a variable, in a forecast or, if reasonable, in causal terms, is “big” while its partial R2 evaluated in the estimation sample is small. If this happens this is usually due to the fact that, conditional on the other 51 It is the same to say that the joint distribution of x and changes. 150 explanatory variables (and maybe even unconditionally) this variable varies very little in the estimation sample and does not determine a relevant part of the dependent variable variance. It may be that, for some reasons, the observed sample to be unbalanced with respect to the population. If, in a more balanced sample, the explanatory variable we are considering is expected to have higher variance, it may be that its contribution to explaining the variance of the dependent variable increases so that it becomes interesting to study its behaviour. However if this is not the case and the sample is representative of the population we are interested in, the “relevant” parameter shall be interesting only if we compare the (few) sample points where the values of the explanatory variable present very different values. A second very simple example: suppose you are interested in the expected life of a sample of patients after a given medical treatment. A small subsample of patients was using a given drug, say A. A new drug, B, is given to all the subjects in the sample and you observe a huge variation in mean survival time across different subjects, say a standard deviation of 10 years over a mean of 5. You also observe that the subsample previously treated with A has the same standard deviation but a mean of 10. Since this subsample is small, the difference between the means shall contribute very little to the overall variance (the partial R2 shall be small) however it would be very proper to suggest the use of A joint with B. Notice that the more you increase the subpopulation which is using A the more the explained variance due to use/no use of A shall increase. This, however, is true only up to the point when the fraction of sample using A is 1/2. If for instance, everybody shall use A, there will be non “variation” of life span due to use/non use of A, but there still be the “effect” of A in the 5 years on average gained by its use. Notice that in this example the reasoning is based on our ability to change the proportion of population using A. Suppose instead A not to be the use of a medicine but the fact that your eyes are one blue and one brown. In this case, observing the same results, we would have very little to suggest except the fact that B seems a very useful medicine for the few lucky (in this case) people with eyes of different colors. So: beware of unbalanced samples. In other settings it may be that we can purposely alter the behaviour of some x not just in terms of level but also of variance. The number of possible cases is huge and here is not the place to go further in this. Some last comment: in the example above we have a case where an irrelevant result in terms of R2 gives us the relevant suggestion that we could assign both drugs A and B to all patients, hence, alter the distribution of X. This is a real possibility, we can give both drugs (at least, if their combination is not harmful) and from this the relevance of the result. Suppose instead that the difference is in terms of other characteristic, say: the color of eyes. In this case we cannot change the percentage of the population with such characteristics, ore “give both colors” to each element of the population. In 151 this case, while interesting, the result is in any case “irrelevant”. Since all estimates and statistics could be identical in both cases, this implies that “relevance” is not something that can be fully resolved only on the basis of Statistics: it requires accurate analysis of the specific problem. It is also easy to show examples where a big partial R2 is, in practice, while important in forecasting terms and maybe also in “causal” terms, not directly of any use (beyond forecasting). Suppose we select a population of women of different ages, according to the marginal distribution by age of women, and attribute to each individual the number of children she gave birth to in the last 5 years. It is clear that the age shall be relevant (in terms of partial R2 ) in “explaining” the variance of the dependent variable. This is expected and cannot be of use as we cannot change the age of the elements in the sample. However, it is going to be important to keep the variable in the regression if we wish to assess the separate effect of other, less obvious but potentially relevant variables on which we can act, as, for instance, the amounts of vitamins in the blood of different subjects, when these variables are correlated with age. Points to remember in reading a regression. To conclude this section let us summarize the steps in reading regression results. Before beginning remember: it may not be necessary for you to discuss “effects” of variables. This is really relevant only if you intend to use the model for a “policy” action. If your purpose is data summary or forecasting “effects” (in the policy sense”) are not the relevant aspect of regression to be studied. On the other hand, if “effects” are of interest for you, regression by itself shall not be able to evaluate these by itself and you’ll need accessory hypotheses in order to be able to assess these. Among the possible accessory hypotheses, an experimental setting may sometimes be useful (if possible). A “structural” approach is another possibility and this is, from the historical point of view, the approach chosen by Econometrics. Both these are not covered in this introductory course. This said, here is the list: 1. Divide the analysis: A) known parameters B) statistical estimate. B) is easy (if you know your Statistics) A) is the tricky part. 2. Understand that your model is a model of a conditional expectation E(Y |X) = Xβ 3. Understand that Y is NOT E(Y |X) but Y = E(Y |X) + ε and V ar(Y ) = V ar(E(Y |X)) + V ar(ε) 152 4. Quantify V ar(Y ) due to V ar(E(Y |X)) and V ar(Y ) due to V ar(ε) that is: compute R2 5. You MUST do this because the purpose of a linear OLS model (with intercept) is that of maximizing R2 6. Moreover, when you discuss the meaning of each βj you are “partitioning” R2 . 7. Understand that the βj of a given Xj can be computed in two ways (partial regression theorem): A) from the overall regression B) first regressing Xj on the other columns of X and then regressing Y on the residuals of this regression 8. Hence βj only pertains to the “effect” on E(Y |X) of what in Xj can change conditional the other X being constant NOT to the effect of a generic change in Xj . 9. Be careful about the meaning of “effect” strictly speaking all you can say is that, if you build forecasts for Y given X using E(Y |X) the said “effect” is that, if it happens that “Nature” gives you two new vectors of observations on X where the difference between the two vectors is just in a different value of only Xj , then the difference between your forecasts is given by the difference in the two values of Xj times the corresponding βj (or its estimate if you are using an estimated conditional expectation). In other words: this tells you nothing, by itself, about the possible change in Y given and act on your part to change some value of Xj . The (by no means easy) study of such a “causal” interpretation has always been very much in the mind of econometricians who evolved structural Econometrics as an attempt to answer the (very interesting due to obvious policy reasons) of assessing the possible results of a change in a variable not just given by “Nature” but acted by a policy maker. The obvious difference between the two cases is that the act could not respect the “Natural” joint distribution of observables as make your previous study of this useless as a source of answer. Just think to the obvious difference in observing, say, interest rates changes induced by the market dynamics and imposing by policy an interest rate change: the laws concerning the effects of the second act could be completely different to the laws concerning the “natural” evolution of rates in the market. On the contrary the “forecast change” is always a good interpretation if the observed change happens without interference. 10. Once you understand the meaning of the “effect” word, quantify this, in your sample (that is: for a given joint distribution of Y and X), with the semi partial t2 (1−R2 ) R2 due to the j − th regressor as j(n−k) (if you are reading a paper and, bad sign,R2 is not available, use the same formula with R2 = 0. This shall give you an overvaluation of the partial R2 ) 153 11. Evaluate the practical significance of this “effect” (while doing this ask yourself if the sample is balanced with respect the explanatory variable and, if not, consider if a balanced version of it is a sensible possibility: see above the examples involving heights of Males and females and the experiment with two medicines). In general this depends on the specific case. However a first rough idea can be gained by computing the change in the conditional expectation of Y induced by a “reasonable” change of Xj . Since this must be a “reasonable” change “which leaves the other explanatory variables unchanged” (recall the meaning of βj induced by the partial regression theorem) this could be measured by the conditional standard deviation other regressors. It could be useful, so, to compute pof Xj given the p the ratio |βj | V ar(Xj |X−j )/ V ar(Y ) which express this “effect” in unit of the standard deviation of Y (the modulus around βj comes from the fact that we get the formula taking the square root of a square). A quick proof shows you that this is exactly identical to the square root of the semi partial R2 for Xj and this confirms the centrality of this quantity. Beware! Do p not be deceived p by a quantity which bears some resemblance to this. This is |βj | V ar(Xj )/ V ar(Y ) which shall obviously be bigger (actually not smaller) and so, maybe, gratifying. The point is that, by not conditioning the variance of Xj it violates the interpretation of the coefficient. By the way: it may well happen that this quantity be bigger than 1, which, obviously, is absurd52 . 12. Then do Statistics (namely: consider that you must estimate β and evaluate the quality of the estimate). 13. Remember: an estimate is “statistically significant” if the ratio of its value to its sampling standard deviation is big enough to say that you can reliably distinguish Most of times, this choice is made when the correct measure of relevance would give as result a very small value, that is: the practical irrelevance of the “effect”. In this case the use of the unconditional standard deviation inflates the result but, since the starting point is very small, the inflated value is smaller that 1 and, apparently, you do not get absurd results. The inconsistency is in any case evident: for instance, you get a semi partial R2 of, say, .0001 for a given Xj and then you find written in the paper that “a change of Xj equal to its standard deviation implies a change of (the conditional expected value of - but sometimes this too is forgotten)Y equal to 1/2 of its standard deviation”. These two informations are evidently conflicting and the solution is that, since the “effect” measure given by βj only has to do with “a change in Xj with the rest of the regressors constant” you cannot use the unconditional standard deviation of Xj as a measure of a “normal” change in Xj (it could be, but in an unconditional setting) you must use the conditional standard deviation. Just take the square root of the semi partial R2 and you get that the correct measure of the change in the conditional expectation of Y given a “reasonable conditional on X−j ” change of Xj given by the conditional standard deviation of this is, in unit of the standard deviation of Y , given by .01. A completely different picture. What is happening: just take the ratio of this with the previous number .01/(1/2)=.02 this shall be the ratio between the unconditional and conditional standard deviation of X_j. It happens that today’s common use of very big samples which can be used only adding a huge number of “fixed effects” makes this event quite common. 52 154 it from zero. 14. Remember: a “statistically significant” estimate could well be practically irrelevant if it corresponds to a small semi partial R2 . Moreover: with enough data you can get any precision you like out of an estimate and distinguish the estimate from zero even if its value is almost exactly zero. If you just play with the sample size you shall see that the semi partial R2 is very little affected by the sample size when this changes from good, to huge and maybe to amazing. However, you are free to use the above suggested (correct) statistical measure of practical relevance but ... what is “small” and “big” in terms of practical relevance, depends on the specific purpose of analysis non on Statistics. In a good empirical analysis the researcher should pre specify which size of an effect is practically relevant and which precision is required for its estimate. This would allow the researcher to choose (when practically possible) a size of the sample big enough in order to be able to give estimates of the parameters precise enough to assess the size of the effect to the required precision53 . 15. In any case remember the difference between significance and relevance. E.g.: beware the use of large datasets. If, in the comments to their results, Authors using large datasets stick too much to “statistical significance” and do not deal with practical relevance, most likely it is the case that their results can be summarized as “a very precise estimate of irrelevant effects”, so that the reading of the main results of the paper can be usually changed into something like: “our data points strongly to the irrelevance of the effect under study”. By the way: while not currently fashionable, such a finding could be of great interest. 16. Finally: Beware of unbalanced samples (this is the same as 11 but it is very important, so I repeat). 9.12.13 Envoy: an example of important points we left out of our analysis. More on coefficient interpretation: are “changes in X” reasonable? We have seen that , from the statistical point of view, it may be difficult to “change Xj independently on the other X-s” because the columns in X could be strongly correlated so that, once we “keep the other X-s constant”, Xj may change of a very limited amount. Let us see this from another point of view. 53 In Finance we have a very good example of this. Factor models break down the overall variance of a return in components correlated with “factors”. The most relevant one is the “market level” but, over time, this has been supplemented by other factors like “value”, “size”, “momentum” etc. These new factors “explain” very tiny fraction of the variance of return, the more so if compared to the market factor. However their study is not irrelevant because you could in practice invest in a portfolio whose only (systematic) change in returns is correlated with just the chosen “factor”. In other words “even a small variance is relevant if you can isolate it from the overall variance”. 155 Each observation on the basis of which we estimated the linear model had a given vector of values for the various Xj . If we change of any amount one of these values with all probability we shall define a vector which does not correspond to ANY observed data. Is this new vector a possible combination of values for X? The question is by no means irrelevant. Suppose you are studying, say, the relationship between the price per kilogram of several cakes and the percentage of different ingredients. You may find that a change of, say 1 percentage point, in a given percentage of a given ingredient, say flour, “keeping the others constant” (and here it is a good exercise to understand the meaning of this as, since the sum of all percentages is 1 it is difficult to change one and keep the other constant) increases the price per kilogram of, say, one dollar. Now, the question is: if I increase the percentage of flour of 1 percent I increase the price of one dollar, and maybe of 2 for 2 percent, but .... wait a moment,... any cook knows that such a change is not going to give you a more costly cake, but no cake at all! Maybe you can do something similar for very small changes in the ingredients, however in this case the result is not going to be so interesting. The question is: the new combination of values for “X” I am hypothesizing does still correspond to a cake? In a sense each recipe is an equilibrium point: the right combination of ingredients (and not only this) yield the good cake. Am I sure that, by modifying one or more ingredients I still get a cake and not a mess? Let us do a small step further and be more Economists. We are interested in production functions and we are regressing, say, the log of the output (in some unit, which?) of different production plants onto the logs of different inputs. Again, supposing we understand the statistical meaning of the estimated coefficients, if we apply such coefficients to a different set of inputs the model gives us a “forecasted” value of output however, before discussing this value, is the combination we are considering a viable combination, and equilibrium combination. We know a priori, by observation, that those combinations of input values which are in X are viable: they PRODUCED an output! But is this still true of the new one? A last step and then we conclude. In many macro models, data are for several variables in different countries and, sometimes, in different periods of time. Each observed data corresponds to some “equilibrium” state as, after all, data WAS observed. Is this true for ANY combination of data we may be interested to use for getting forecasts or should, a priori, be able to show that such combination IS viable? In recent years it has become popular to regress some economic variable, e.g. the growth rate of GNP or the change in size of the debt or similar, on a cross section of countries with data describing some sociopolitical characteristic of such countries (coming as a rule from large scale surveys). Forget for a moment the fact that such 156 datasets could be heterogeneous and could violate basic OLS hypotheses. Suppose you get an estimate of the OLS coefficients. How can you read such estimates, what is the use of the results? Do you believe that you can change one or more the characteristics of a country and still get a new viable country which satisfies the regression equation and for which the forecasted value of the dependent variable is a possible output? Provided that you really may change such characteristics, the likely result is either that you destroy an equilibrium and, if a new equilibrium is reached, it shall be completely different than the old one. So beware: a linear regression model is useful to describe the “output” corresponding to different sets of “inputs” but any reading of its results in terms of “change of Y to change of X” is strongly dependent on the hypothesis that any “test” combination of X-s you are interested is is a “viable” or “equilibrium” combination. It should be clear enough that to assess the validity of such a proposition is by far more difficult than to simply estimate the model. 9.12.14 Further readings (Absolutely Not Required for the exam) Probably the best book on linear regression, from the point of view of interpreting results, with strong, detailed, statements about the difference between forecasting and causal analysis, lots of examples and hindsight and with a minimum of Mathematics (probably because it was written by very good mathematicians), is: Mosteller, F. and Tukey, J. W. (1977). “Data Analysis and Regression: A Second Course in Statistics”. Addison-Wesley, Reading, MA. In particular, see ch. 13 with the meaningful title: “Woes of Regression Coefficients”. A good and more concise summary can be found in: Sanford Weisberg (2014) “Applied Linear Regression”, III ed., Wiley. In particular, see Ch. 4. A short paper by a great statistician which contains, in simple and condensed form, most of what was discussed here, is: George E. P. Box (1966) “Use and Abuse of Regression”, Technometrics, Vol. 8, No. 4 (Nov., 1966), pp. 625-629 For the maths of semi partial R2 joint with a keen discussion of “effect sizes” you may see: Jacob Cohen e.a. ( 2013) “Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences” (English Edition), III ed, Routledge. To those interested in reading something more about the different interpretations of a linear model (e.g. forecast VS causal) which make, arguably, a very tricky and slippery field to walk on, the following books could be useful: J. D. Angrist and J. S. Piscke (2009) “Mostly Harmless Econometrics”, Princeton University Press. J. Pearl e.a. (2016) “Causal Inference in Statistics: a Primer”, Wiley. 157 9.12.15 Examples from the literature: interpreting VS advertising a regression We report here three examples from published papers in main journals which could be useful both as examples in correct and wrong reading of simple regressions. These are not Finance papers as I had to limit myself to standard OLS examples and these are not so frequent in recent literature (regression models are used a lot but, due to the specific settings, the estimation method is not simple OLS so that the above results are no more STRICTLY valid). There is a obvious sample selection bias in this choice: I chose papers where the readings of regressions results, while rather standard, left many points to be discussed. The comments below are, and are to be intended as, limited to the reading of specific regression results and do not extend to the full paper. It is frequently the case that a paper contains very interesting ideas even when such ideas are, in the specific instance, really not upheld by the empirical analysis presented in the paper itself. It is not a joke to say that, at least in Economics and Finance, it can be very difficult to find good empirical grounds for assumptions so reasonable as to be “necessarily” true. Sometimes this would induce the best researcher in doubting the data and not the hypothesis. In practice, sometimes this induces even the best researcher into a “creative” use of Statistics so to make “statistically relevant” what SHOULD, a priori, be relevant. The first example is drawn from “Distributive Politics and Economic Growth”, Alberto Alesina and Dani Rodrik, The Quarterly Journal of Economics, Vol. 109, No. 2, (May, 1994), pp. 465-490. The regressions we consider have as dependent variable the average per capita growth between 1960 and 1985. The purpose of the models is to measure the dependence of growth from the initial value of the Gini index. The explanatory variables are: per capita GDP, Primary school enrollment ratio, Gini index for the concentration of the income distribution and Gini index for the concentration of land ownership distribution. A dummy (0-1) variable for democracy is included and the product of this times the Gini/land coefficient is considered. Regressions are run on several subsections of a sample but we do not comment on this, moreover we only consider OLS estimates. 158 Quoting from the paper: “The results indicate that income inequality is negatively correlated with subsequent growth. When either one of the two Gini’s is entered on its own, the relevant coefficient is almost uniformly statistically significant at the 5 percent level or better and has the expected (negative) sign. The only exception is the OLS regression for the large sample (column (3)), where the income Gini is statistically significant only at the 10 percent level. We also note that the t-statistics for the land Gini are remarkably high (above 4), as are the R2 ’s for the regressions that include the land Gini’s. When the land and income Gini’s are entered together, the former remains significant at the 1 percent level, while the latter is significant only at the 10 percent level (the sample size shrinks to 41 countries in this case, since many countries have only one of the two indicators). 159 The estimated coefficients imply that an increase in, say, the land Gini coefficient by one standard deviation (an increase of 0.16 in the Gini index) would lead to a reduction in growth of 0.8 percentage points per year.” Let us comment this. “Income inequality is negatively correlated with subsequent growth”: we (as the Authors) see that in the regressions with either the income Gini or the land Gini coefficient, but not both, the T ratio of the included variable estimate is “statistically significant” with a P-value smaller that .05. However, when both variables are included together (7) and (8) only the land Gini coefficient is significant. Using the partial regression theorem, we can say that the two Gini’s coefficients series are correlated, but the coefficient which has an “effect” (in forecasting terms) on growth is NOT the Gini coefficients of income, but the Gini coefficient of land distribution. If we drop the land distribution Gini, the income distribution Gini becomes relevant as a proxy (due to the correlation) of the land distribution Gini. So, variation of income distribution, when not correlated with variation in land distribution, has no relevant correlation with growth. Hence the above quoted paragraph should begin with: “The results indicate that land distribution inequality, and not income inequality per se, is negatively correlated with subsequent growth” or, better, “The results indicate that land distribution inequality is negatively correlated with subsequent growth. Income inequality is not related with subsequent economic growth EXCEPT in that part which is explained by unequal land distribution” (for this reason: when both variables are in the model, the land distribution Gini prevails). In fact, this could add interest to the Authors’ conclusions. Now about the “size” of the effect (see the general discussion above). “The estimated coefficients imply that an increase in, say, the land Gini coefficient by one standard deviation (an increase of 0.16 in the Gini index) would lead to a reduction in growth of 0.8 percentage points per year.”. This is a (rather frequent) incorrect reading of the meaning of a regression coefficient. We already commented about this point. It is incorrect for several reasons. The regression has to do with conditional expected values of the dependent variable not with observed values. The two would be similar if the regression R2 were near one, which is not the case here. In general the “full” change in Y is the change in the conditional expectation PLUS a random error (which, in this example, and if we suppose we are in model 5, has about the same variance of the conditional expectation as the R2 is of the order of .53). In short: we may speak of “reduction of the conditional expectation of growth” not of “reduction in growth”. Each coefficient has only to do with the “effect” (again: this has only to do with a change of forecast not any causal interpretation. We have glossed so much on this word that we should be able to use it here as a shorthand while avoiding wrong ideas), 160 on the conditional expectation of the dependent variable, of a change of, in this case, the Gini land coefficient “keeping all the other variables constant”. If I want to express the effect wrt nσ changes in the Xj I must consider the conditional σ for Xj GIVEN the other columns of the X matrix. In the above example: we do not know (data are not provided) the correlation of the Gini land index with the other independent variables but it is quite clear that this is not zero. If we suppose the estimate to come from model (5) the estimated parameter value is -5.50 and the T −ratio -5.24 computing the semi partial R2 with n = 46 and k = 4 and an overall R2 = .53 we have that the amount of the overall R2 due to the Gini land variable is 5.242 (1 − .53)/42 = .3. The square root of .3 is about .55. According to what we know about the semi partial R2 , this means that a change of one conditional sigma of the Gini land times the estimated parameter, divided by the standard deviation of the dependent variable, is equal to .55. In order to get results in terms of values of the dependent variable we need data on the standard deviation of this which are not available. A final question: even if we accept the interpretation of the paper, does this mean that if we act reducing land distributions inequality we should get an increase of growth? Notice that the paper does not explicitly say that by “increase” it is intended some change due to policy actions, revolutions, or any act which alters the “equilibrium” expressed by the dataset. In the conclusions of the paper, however, this seems to be the idea of the Authors. In fact their idea seem to be that what is observed is not an equilibrium at all: “The basic message of our model is that there will be a strong demand for redistribution in societies where a large section of the population does not have access to the productive resources of the economy. Such conflict over distribution will generally harm growth. Our empirical results are supportive of these hypotheses: they indicate that inequality in income and land distribution is negatively associated with subsequent growth”. Whatever be the connection of such a statement whith the empirical analysis present in the paper, it may be useful to remind that the paper presents an observational study. Hence, the reading of the results made in the paper is compatible, under some stationarity hypothesis, with a forecasting use: we are measuring how much forecasts of growth differ for countries with different Gini coefficients of, say, land and identical values of other variables. Doing this we just observe, we, politicians or revolutionary leaders, do not act by forcing variables to specified values and we do not know the result of such actions. The study of the implications of an action toward reduction of inequality, whatever the origin of such action, would require a “causal” approach which is not developed in the paper. The causal approach, given the impossibility of experiments, should be based on a 161 structural model to specify in detail which shape would take the policy toward changing income concentration (or better, land concentration) and, for instance, clarify under which conditions such policy would keep unaltered both “the other variables” and the overall structure of the conditional expectation. Alternatively, the structural model should specify in which ways the policy action would change these. Any causal interpretation is simply without grounds if these conditions are not satisfied. We pass now to a second paper: “Fairness and Redistribution”, Alberto Alesina and George-Marios Angeletos, The American Economic Review, Vol. 95, No. 4 (Sep., 2005), pp. 960-980. 162 Here we see an example of a rather anomalous model. The dependent variable is bounded, this by itself may create problems to the validity of OLS hypotheses54 , but we do not comment on this. Instead, we choose this example because it is a clear instance of the “significance thru sample size”+”statistically significant means relevant” pitfall we mentioned above. The idea is as follows. Under OLS hypotheses, the variance of the estimates is, roughly, decreasing linearly in n the number of observations. This means that even very small parameters, of no practical consequence, can be estimated with such a precision to be distinguishable from 0. However, “statistically significant” actually means only this, roughly: “the parameter estimate as a sampling variance small enough that we are able to say the parameter is not exactly zero”. In other words, very small parameters can be distinguished from 0, if n is big: they are “statistical significant”, their T −ratio are big enough to reject a null hypothesis of zero for any sensible size of error of the first kind. As explained above, this is not sufficient to say that such parameters have a size which is relevant in economic sense or any other sense. Readers of standard books of Statistics are frequently warned about this point, but the confusion between “statistical significance” and “relevance” is still there to be observed in applied research across very different fields (for this exact reason books on basic Statistics, and these handouts, still warn you about this problem). In a typical case with “big n”, but not very much relevant parameters, we have regressions where most parameter are “statistically significant” while the overall R2 and/or, the semi partial R2 of the parameters of interest, is very small. In cases like this, the correct overall interpretation is: “OK: I have a statistically very stable estimate of a very small parameter. This implies the irrelevance of the corresponding variable variable in the regression at least in the sense that it contributes very little to the variance of the dependent variable55 . In this sense I can say that the overall effect of he variable (in forecast terms, without further causal analysis) is well estimated to be negligible”. In other words, a simple look to the R2 (or semi partial R2 value), joint to the fact that the number of observations is big so that, if the OLS hypotheses are valid, the estimate the parameters (and of the R2 ) is statistically very stable, should prevent any further analysis of the model except an analysis directed to establish why variables Both in the weak and strong version. Since the dependent variable is between 0 and 1, the forecast plus the error should be bounded between 0 and 1. This implies a (non linear) dependence between forecast and error as a forecast near, say, one, is compatible only with a small positive and possibly bigger negative error. Moreover errors cannot be Gaussian as this probability distribution support is unbounded. More specific models (Logit, Probit, etc.) exist for this kind of data, however, the use of a linear regression, named: “linear probability model”, can still be a first useful approximation. 55 We discussed the fact that it may sometimes be the case that some variable which contributes little to the R2 is “relevant” in some other sense. Moreover, sometimes to discover that some variable does not contribute to the R2 could be quite of interest. 54 163 that, a priori, the researcher considered relevant, seem to have, in the data, negligible effect. In fact, the relevant (and it IS relevant) information we can derive from the model is that the effect of any of the explanatory variables on the dependent variable is substantially zero. Notice that this could also imply a problem in the design of the empirical analysis and is quite useful an information. Notwithstanding the usefulness of such results, sometimes (and in some field) both researchers and journals consider such results non satisfactory. This may be the reason why, for instance in this particular case, we read: “As Table 2 shows, we find the belief that luck determines income has a strong and significant effect on the probability of being leftist.” Notice the term “effect”. As in the case of the previous paper, this could be implying, or not, a causal analysis, which is not developed in the paper. The doubts on the effective purpose of the paper in this respect are the same as those raised before, so we do not discuss these any more. Here, we shall intend the term as “ability to forecast” and not as anything to do any with possibility of intervention. Now to the “strong and significant” clause. This is what the Authors write. Let us see how this idea, which is in contrast with the reported results, could arise (the contrast in not with the word “significant”, if this means “statistically significant”, but with the word “strong” if it is taken with any sensible meaning implying any relevant “explanatory” power of the regressors). The dependent variable is a 0 to one index related to the answer to the question: "In political matters, people talk of left and right. How would you place your views on this scale, generally speaking?". The variable of interest for the Authors is the “individual belief that Luck determines income” (which, I think, is again a 0 to 1 variable). The corresponding estimate coefficient, depending on the model, is .54 or .607 (but you should consider the second, why?) the corresponding T −ratio is 3.88 (at least, we suppose that this is the T −ratio as stated by the Authors. In fact the T −ratio should have the same sign as the parameter estimate while the reported ratios are all positive). On the basis of this info the statement of a “strong and significant effect” of the belief about luck is unwarranted. In fact the Authors include both the model with and the model without the variable of interest and the overall R2 changes of only .01. This would be a direct estimate of the semi partial R2 but we must take into account, however, that the samples are not the same for the three regressions. Using our Corollary with n = 14998 and k = 16 we have that the contribution of the “luck” variable to the overall R2 is almost exactly .001. The square root of this is roughly 3/100 so that the expected change in the conditional expected value of the dependent variable due to a change (“keeping the rest constant”) of one conditional standard deviation of the explanatory variable is of the 164 order of 3% of the standard deviation of the dependent variable. While statistically significant (a T-ratio above the 5% level) the effect is in fact negligible or, in better words, it is well estimated to be of a negligible size. Suppose that the “luck” variable is itself between 0 and 1 and consider the extreme values and even suppose there is zero correlation between the “luck” variable and the other variables. The difference of the conditional expectation in the extreme cases is of .607, which in a 0-1 scale may seem big. However, what you observe is not the expected value of the dependent variable but this plus the error, and only 1/1000 of the variance of expected value plus error is due to the (extreme) change in the explanatory variable, hence to a change in the expected value. Sure, it is easier to win in betting on head with a coin where the probability of head is .501 than .500 however, I would not say that I have a “significant and strongly higher probability” to win if I bet on head (even if such a very small difference in probability can be very well estimated, and so be statistically significant, if the number of observations is very big). What could be said is that n is so big that even small differences between the probabilities of head and tail can be estimated with statistical reliability even when very small, this is the only meaning of “statistical significance” 56 . Our last example is drawn from “Does Culture Affect Economic Outcomes?”; Luigi Guiso, Paola Sapienza, Luigi Zingales, The Journal of Economic Perspectives, Vol. 20, No. 2 (Spring, 2006), pp. 23-48. At this point in the Handouts the Reader should have understood a leitmotiv of our presentation: a good user of Statistics, before even observing data, should have a clear idea about which “size” of “effects” can be distinguished on the basis of the available data and about the implied ability of the data to yield relevant information. 56 165 The dependent variable here is a 0, 1 (not 0-1, only 0 and 1) variable where 1 means that the respondent is self employed and 0 that is employed but not self employed. As in the previous case there are problems in using linear regressions in this case but we do not discuss this. The explanatory variable is “Trust” and it is a dummy variable equal to 1 if there is a positive answer to a question related to “trust in others”. Other variables are added as “controls”. Again, while the purpose of the paper is not clear, here by “effect” we do not intend the effect of an action but just a measure related to forecasting. Probably, the Authors have in mind a causal effect. In fact the Authors use a method (instrumental variables) which we do not discuss in this introductory course 166 and which tries to estimate the regression coefficient of a variable in a regression with many variables of which some is not observed. A common mistake is to consider this as equivalent to the measure of a “causal” effect. This is wrong but, both the introductory level of this course and the information contained in the paper cannot allow us to go further on in this topic. The estimate to be considered is the one corresponding to the second model (as usual, if estimates change adding variables it is better only to consider the model with the greater number of variables). The value is .0167 and the standard deviation .0046 (this is an estimate derived with a formula somewhat different than the OLS one but this does not change out interpretation of the result. We do not have the overall R2 , however we can use our corollary in order to estimate the amount of R2 due to the Trust variable. with n = 22791 and k = 17 we have that this amount is, at most, 0.0005. The Authors comment is: “As Table 1 reports, trust has a positive and statistically significant impact on the probability of becoming an entrepreneur in an ordinary least squares regression (the probit results are very similar). Trusting others increases the probability of being self-employed by 1.3 percentage points (14 percent of the sample mean)”. This sentence (again: in a forecast sense) IS (partially) correct, as the impact IS positive and statistically significant. Moreover using the term “probability” the Authors are clearly considering the expected value of the dependent variable (they should use the term “conditional probability” but this would perhaps be too pedantic). If we recall that the square root of the semi partial R2 is equal to: p ˆ | V ar(Xj |X−j )|β j p V ar(Y ) and compute this, we get a value the order of 2.2%. Since the dependent variable empirical variance is n0.092(1 − 0.092) (1.3 is 14% of the sample mean and the sample mean is the relative frequency of ones) a change of the square root of this of the order of 2.2% is equivalent to a change of the frequency of self employed of less that 0.005 (a bit improperly you may think to this as the change of roughly 110 units in the sample of 22791 units from not self to self employed). The Authors are not claiming that the effect of Trust is of any practical relevance, however they do not point out, on the opposite, that the effect is arguably WITHOUT ANY practical consequence. It is clear that, in this example, the sample is so big that, if we suppose standard iid hypotheses are valid, very small “effects” can be measured with precision, hence be statistically significant, but such precision in estimating small effects does not make them relevant. The purpose of the paper is subsumed in the following sentence: “Having shown 167 that culture as defined by religion and ethnicity affects beliefs about trust, we now want to show that these beliefs have an impact on economic outcomes”. The empirical results of the paper do not seem to clearly point to this conclusion. This is not to affirm that “culture has no effect on economic outcomes”, it is very likely that such effects exist. The problem is how to measure these effects, as defined by the Authors, with the available data. The correct reading of the empirical results of the paper is that, under the definition and with the data of the paper, such effects are measured as substantially negligible. Since we can a priori argue for the existence of such effects, the point is now to find, if possible, a proper empirical measure/definition of “culture” and proper data such that those effects can actually be estimated. Obviously, it could be the case that with a proper definition of “culture”, a simple measuring of its “effects” as based on a linear regression model, shall be seen as inappropriate. Culture is a very rich construct, it is likely difficult to reduce it, in a meaningful way, to one of more qualitative or cardinal variates. Even when this is possible, why should be its “effects” be expressed and measurable as monotonic, even linear, contributions to the conditional expectation of some variable? Any analysis of “cultural effects on economics” the like of the one contained in the quoted paper should begin by suggesting a solution to these practical modeling problems on the penance of irrelevance. In this particular case, in fact, a correct reading of the results strongly suggest for the irrelevance not of culture for economics but of this way of measuring it. 9.12.16 Concluding summary As an overall comment to these examples, I would like to stress the need for any user of economic or financial research, and more in general empirical research, to provide him/herself with the basic tool for “filtering out” excusable rhetoric noise from content when reading other people’s papers. It is absolutely understandable, if maybe a little scary from point of view of a laymen’s view of science, for Authors to “sell” a paper and try to put their results in the best possible light. This is true in all fields. However, empirical research has a role and a consequence only when both researchers and readers “share the code” which allows them to separate effective content from (admissible) advertising. The knowledgeable reader shall understand that, when we consider the selection in the universe of papers of those actually published in main journals, the hypothesis that “paper salesmanship” counts in having a paper accepted implies that many examples shall be found in main journals similar to those summarized here. 168 Clearly, it is hopefully far less likely that interesting results shall be rejected for lack of salesmanship. In fact, it should be easy to sell really interesting results. These should “sell themselves”. Such a selection effect must be taken into account while reviewing any strain of empirical literature. There exists a subfield of Statistics called “meta analysis” that deals with these problems. Interestingly meta analytic studies of the literature are quite frequent in medicine and biology, not, until recently, in Economics or Finance. Examples Exercise 9-Linear Regression.xls 10 Style analysis Style analysis is interesting both from the point of view of practitioner’s finance and as an application of the linear regression model. The current version of the model was elaborated by William F. Sharpe in a series of papers beginning in 1989. In this summary we shall refer to the 1992 paper (as of November 2018 you may download it at http://www.stanford.edu/∼wfsharpe/art/sa/sa.htm). In order to understand the origin of the model we must recall the intense debate developing during the eighties about the validity of the CAPM model, its possible substitution with a multifactor model and the evaluation of the performance of fund managers. In a nutshell (back to this in some more detail in the next chapter): a factor model is a tool for connecting expected returns of securities or securities portfolios to the exposition of these securities to non diversifiable risk factors. The CAPM model asserts that a single risk factor, the “market”, or, better, the random change in the “wealth” of all agents invested in the market, is priced in terms of a (possible) excess expected return. This factor is empirically represented by the market portfolio, that is: the sum of all traded securities. The expected return of a security in excess of the risk free rate (remember that we are considering single period models) is proportional to the amount of the correlation between the security and the market factor. The proportionality factor is the same for all securities and is called price of risk. Multifactor models, such as the APT, suggest the existence of multiple risk factors (not necessarily traded) with different prices of risk, so that the cross section of expected security (or security portfolios) excess returns is “explained” by the set of the security expositions to each factor. Classical implementations of the APT were based 169 on economic factors, some were tradable, like the slope of the term structure of interest rates, some, at least at the time, non tradable, as GNP growth and inflation. At the turn of the nineties Fama and French, followed by others, produced a number of papers where factors were represented by spread portfolios. The most frequently used factors were based on the price to book value ratio, on the size of the firm and on some measure of market “momentum” (relative recent gain or loss of the stock w.r.t. the market). These factors were represented, in empirical analysis, by spread portfolios. As an instance: the price to book value ratio was represented by the p&l of a portfolio invested, at time zero, in a zero net value position long in a set of high price to book value stocks and short in a set of low price to book value stocks. Fama and French asserted that the betas w.r.t. this kind of factor mimicking portfolios were “priced by the market”, that is, the correlation of a stock return with such portfolios implied a non null risk premium. Consider now the problem of evaluating the performance of a fund manager. A preliminary problem is to understand for which reason you, the fund subscriber, should pay the fund manager. Obviously, you should not pay the fund manager beyond implementation costs (administrative, market transactions etc) for any strategy which is known to you at the moment you subscribe to (or do not withdraw from) the fund if this strategy gives “normal” returns and if you can implement it by yourself. Suppose, for instance, that the asset allocation of the fund manager is known to you before subscribing the fund. Since the subscription of the fund is your choice the fund manager should not be paid for the fund results due to asset allocation, or, better, should not be paid for this beyond implementation costs. A bigger fee could be justified only if, by the implementation of management decisions you cannot forecast on the basis of what you know, the fund manager earns some “non normal” return. This is the reason why index funds should (and, in markets populated by knowledgeable investors, usually do) ask for small management fees. What we say here is that this should be the same for any fund managed with some, say, algorithm, replicable on the basis of a style model like, for instance, funds which follow asset selection procedures based on variants of the Fama and French approach (that is: stock picking based on observable characteristics of the firms issuing the equity as, for instance, accounting ratios, momentum etc. While implementing such models requires some care and a lot of good data management, the reader should be aware of the fact that nothing magic or secret is required for the implementation of these algorithms. The fund manager contribution, with a possible value for you, if any, should be something you cannot replicate, that is: either something arising from (unavailable to you) abilities or information of the manager or, maybe, from some monopolistic or oligopolistic situation involving the manager. Let us suppose (a very naive idea!) that the second hypothesis is not relevant. A formal way to say that the manager ability is not available to you is to say that you cannot replicate its contribution to the fund return with a strategy conceived on the basis of your knowledge. 170 Notice that for this reasoning to be valid it is not required that you actually perform any analysis of the fund strategy before buying it. Perhaps we could agree on the fact that you should perform such an analysis, before buying anything. A mystery of finance is that people spend a lot of money in order to buy something whose properties are unknown to the buyer. People wouldn’t behave in this way when buying, say, a car or even a sandwich. However, any lack of analysis simply means that something more unexpected by you, shall become (on your opinion) merit or fault of the fund manager. It is important to understand that, according to this view, the evaluation of the performance of a fund manager is, first of all, subjective. It is the addition of hypotheses on the set of information used by subscribers and on their willingness to optimize using these information that can convert the subjective evaluation into an economic model. The problem here is, obviously, to define what we mean by “normal return” and “known strategy”. Here a market model, representing efficient financial use of public information, could be the sensible solution. Were the market model and the effective asset manager’s asset allocation available, the first could be used to define the efficiency of the second and, by difference, possible unexpected (by the model) over or under performances on the part of the fund manager. Alas, for reasons that shall be discussed in following sections, satisfactory empirical versions of market models still have to appear or, at least, versions of market models, and statistical estimates of the relative parameters, strong enough to be agreed upon by everybody an so useful in an inter-subjective performance analysis. A less ambitious and more empirically oriented alternative is return based style analysis. This alternative yields a (model dependent) subjective statement about the quality of the fund. We shall return on this point but we stress the fact that, if the purpose of the method is for a potential subscriber or for someone already invested in the fund to judge the fund manager performance and not for some agency to award prizes, the subjective component of the method is by no means a drawback. Return based style analysis can be seen as a specific choice of “normal return” and “known strategy” definitions. The “known strategy” is the investment in a set of tradable assets (typically total return indexes) according to a constant relative proportion strategy, the “normal return” is the out of sample return of this strategy previously tuned in order to replicate the historical returns of the fund. This point has to hammered in so we repeat: the strategy is not chosen in order to yield “optimal” returns (in any case the lack of a market model would impede this) but only in order to replicate as well as possible, in the least squares sense, the returns of the fund strategy. In order to estimate the replica weights, the returns (RtΠ ) of the fund under investigation are fitted to constant relative proportion strategy with weights βj invested in a set of k predetermined indexes with returns Rjt : X RtΠ = βj Rjt + t j=1,...,k 171 The term “constant relative weights strategy” indicates, as usual, a strategy where the proportion of wealth invested in any given index is kept constant over time. This implies that, when some index over performs other indexes, a part of the investment in the over performing index must be liquidated and invested in the under performing indexes. For the sake of comparison other possible strategies could be the buy and hold strategy where a constant number of shares is kept for each index and the trend following strategy, where shares of “loser” indexes are sold to buy shares of “winner” indexes. Both these strategies have variable weights on returns and could reasonably be used as reference strategies. There exist variants of the constant relative proportions strategy itself. In a constrained version the weights could be required to be non negative (short positions are not allowed). In another version weights could be allowed to change over time (in this case we should assume that the sum of all weights is constant over time). In typical implementations no intercept is in the model and the sum of betas is constrained to be one. The constant is dropped because it is usually interpreted as a constant return and, over more than one period, a constant return cannot be achieved even from a risk free investment. The assumption that the sum of all weights is one is an assumption required for the interpretation of the weights as relative exposures and, in the case of a multi period strategy, in order for the portfolio to be self financing. While both interpretations and both constraints could be challenged, in our applications we shall stick to the common use. We only relate the fact that, sometimes, instead of imposing the “sum to one” constraint explicitly at the estimation time 57 this is implemented on a a posteriori basis by renormalizing estimated coefficients. The two methods do not yield the same results. A relevant point in the choice of the reference strategy is that it should not cost too much. In this sense the constant relative proportions strategy could be amenable to criticism as it can imply non negligible transaction costs. The reason for its use in style analysis seems more leaning on tradition than on suitability. Notice that in no instance we are supposing that the fund under analysis actually follows a constant relative proportion strategy invested in the provided set of indexes. We are NOT trying to discover the true investment of the fund but only to replicate its returns as best as we can with some simple model. This point has to be underlined because, at least in the first paper on the topic, Sharpe himself seems to state that the purpose of the analysis is to find the actual composition of the fund. This is obviously impossible if it is not the case that the fund is invested, with a constant P The j βj = 1 constraint can be imposed to the OLS model in a very simple way. First chose any Rjt series, say R1t . Typically the choice falls on some series representing returns from a short term bond but any choice will do. Second compute R̃t = Rt − R1t and R̃jt = Rjt − R1t for j = 2, ..., k. Now regress R̃t on the R̃jt for j = 2, ..., k. After running the regression the coefficient for R1t , which Pk you do not directly estimate, shall be equal to 1 − j=2 βj . 57 172 relative proportions strategy, in the indexes used in the analysis. In fact, the actual discovery of the composition of the fund and its evolution over time would hardly add anything to the purpose of identifying the part of the fund’s strategy not forecastable by the fund subscriber. A model would still be needed in order to divide what is forecastable from what is unforecastable in the fund evolution. Let us go back to the identity: X RtΠ = βj Rjt + t j=1,...,k Up to now this is not an estimable model but, as said above, an identity. In order to convert it into a model we must assume something on t . A way of doing this is to recall the chapter on linear regression. The style model is clearly similar to a linear model. In particular it is similar to a linear model where both the dependent and independent variables are stochastic. In this case we know that a minimal hypothesis for the OLS estimate to work is that E(|RI ) = 0 where is the vector containing the observations on the n t -s and RI is the matrix containing the n observations on the returns of the k indexes. The second, less relevant, hypothesis is the usual E(0 |RI ) = σ2 In . The hypothesis E(|RI ) = 0 has a sensible financial meaning: we are supposing that any error in our replication of the fund’s returns is uncorrelated with the returns of the indexes used in our replication. Sharpe’s suggestion for the use of the model in fund performance evaluation is as follows: given a set of observations (typically with a weekly or lower frequency, Sharpe uses monthly data) from time t = 1 to time t = n fit the style model from t = 1 to t = m < n and use the estimated coefficient for forecasting Rm+1 then add to the estimating set the observation m + 1 (and, in most implementations) drop observation 1. Forecast Rm+2 and so on. These forecast represent the fund’s performances as due to its “style” where the term “style” indicates our replicating model. The important point is that this “style” result is forecastable and, in principle, replicable by us. The possible contribution of the fund manager, at least with respect to our replication strategy, must be found in the forecast error. The quality of the fund manager has to be evaluated only on the basis of this error. There are three possibilities: • The fund manager return is similar (in some sense to be defined) to the replicating portfolio return. In this case, since you are able to replicate the result of the fund manager strategy using a “dumb” strategy, you shall be willing to pay the fund manager only as much as the strategy costs. • The fund manager returns are less than your replica returns. In this case you should avoid the fund as it can be beaten even using a dumb strategy which is not even conceived to be optimal but only to replicate the fund returns. This is a 173 strong negative result. While it is true that it is possible to find alternative assets that, when calibrated to the fund returns in a style analysis, give a positive view of the same manager results, the fact that a simple strategy exists that beats the fund returns is enough to put under discussion any fund manager’s ability. • The fund manager returns are better than your replica strategy. In this case it seems that the manager adds to the fund strategy something which you cannot replicate. This is an hint in favor of the fund manager ability. It is a weak hint, for the same reason the negative result is a strong hint. The negative result is strong because a simple strategy beats the fund manager’s one, the positive result is weak because the fund manager beats a simple strategy but other could exist which equate or even beat the fund manager strategy. In any case this is an at least necessary condition for paying a fee greater that the simple strategy costs. The important point to remember, here, is that the result is relative to the strategy and the asset classes used. No attempt is made to build optimal portfolios with the given asset classes, only replica portfolios are built. The reader should think about the possible extensions of procedures like style analysis were a market model available A simple example of style analysis using my version of Sharpe’s data and three well known US funds is in the worksheet style analysis.xls. 10.1 Traditional approaches with some connection to style analysis The idea that you should find some “normal” return with which to compare a fund return and that this definition of “normal” return is to be connected with the return of some “simple strategy” related with the fund’s strategy is so basic that many empirical attitudes are informally justified by it. On a first level, we observe very rough fund classifications in “families” of funds, defined by broad asset classes. This suggests comparisons of funds to be made only inside the same family. In a sense the comparison strategy is implicitly considered as a mean of the strategy in the same asset class. Another shadow of this can be found in the frequently stressed idea that the result of any fund management must be divided between asset allocation and stock picking. In common language this partitioning is not well defined and asset allocation may mean many different things as, for instance, the choice of the market, the choice of some sector, the choice of some index. Moreover there is no precise definition of how to distinguish between asset allocation and stock picking. But it is clear that this distinction, again, hints at some normal return, derived by asset allocation, and some residual: stock picking. The “benchmarking” idea is another crude version of the same: you try to separate the fund manager’s ability from the overall market performance by devising a 174 benchmark which should summarize the market part of the fund manager strategy. Market models can be seen as a step up the ladder. Here the benchmark idea is expressed in a less naive way. Under the hypothesis that the market model holds and is known and the beta (CAPM) or betas (APT) of the fund are known, the part of the result due to the market factor(s) is to be ascribed to the overall fund strategic positioning and, as such, its consequences are in principle a choice of the investor. Any other over or under performances can be ascribed to the fund manager abilities and private information. As we mentioned above, this use of market model is greatly hampered by the fact that the proposition “...the market model holds and is known and the beta (CAPM) or betas (APT) of the fund are known” simply does not hold. Now a few words on comparison criteria. The classical Sharpe ratio considers the ratio of a portfolio return in excess to a risk free rate to its standard deviation. Even in this form the Sharpe ratio is a relative index: the fund performance is compared to a riskless investment. In general this comparison is not a useful one. Typically our interest shall be to compare the fund performance with a specific strategy, which, in some instance, could be the best possible replication of the fund’s returns accomplished using information available to the investor. In many cases this reference strategy shall be a passive strategy (this does not mean that the strategy is a buy and hold strategy but that the strategy can be performed by a computer following a predefined program). As considered before, such a strategy could be provided, for instance, by some asset pricing model (CAPM, APT etc.). In other cases the reference strategy could simply be represented in the choice of a benchmark used either in the unsophisticated way where, implicitly, a beta of one is supposed (that is, at the numerator of the Sharpe ratio take the difference between the returns of the fund and those of the benchmark) or in the more sophisticated way of computing the alpha of a regression between the return of the fund and the return of the benchmark. Otherwise the reference strategy could be based on an ad hoc analysis of the history of the fund under investigation. Style analysis is a way to implement this analysis. Two relevant final points. First: the comparison strategy should always be a choice of the investor. It is rather easy, from the fund’s point of view, to choose as comparison a strategy or a benchmark with respect to with the strategy of the fund is superior, at least in terms of alpha. This is known as “Roll’s critique”. While the fact that the strategy chosen by the investor as comparison is dominated by the fund strategy is admissible as, usually, the fund does not tune its strategy to this or that subscriber comparison strategy (at least this is true if the subscriber is not big!), when it is the fund to choose the comparison strategy a conflict of interests is almost certain. Second: once identified the part of the strategy due to the fund manager intervention, a summary of this based on the Sharpe ratio or on Jensen’s alpha is only one 175 of the possible choices and strongly depends on the subscriber’s opinion on what is a proper measure of risk and return. 10.2 Critiques to style analysis Under the hypotheses and the interpretation described in the previous section style analysis can be considered an useful performance evaluation tool. However, at least in the version suggested by Sharpe, it lends itself to some strong critique. A first very simple critique concerns the choice of the replicating strategy. While the use of indexes does not create big problems, at least when these indexes can be reproduced with some actual trading strategy, a big puzzle lies in the choice of a constant relative proportion strategy. This is both an unlikely and a costly strategy, due to portfolio rebalancing. The typical simple strategy is the buy and hold strategy, most indexes are, in principle, buy and hold strategies and the market portfolio of CAPM is a buy and hold strategy. As seen in chapter 1 the buy and hold strategy is NOT a constant relative proportion strategy. Moreover, a buy and hold strategy, typically, implies very small costs (the reinvestment of dividends is the main source of costs if there is no inflow or outflow of capital from the fund) while a constant relative proportion strategies implies a frequent rebalancing of the portfolio. Now, the replicating strategy is a free choice of the analyzer, however, if we simply suppose that the fund follows a buy and hold strategy in the same indexes used by the style analyzer we end with a strange, if perfectly natural, result. Obviously the R2 of the model shall not be 1, except in the case of identical returns for all the indexes involved in the strategy, moreover the analysis shall point out as “unforecastable” and so due to the fund manager action, any return of the fund due to the lack of rebalancing implied in a buy and hold strategy. Suppose, for instance, that some index during the analysis period should outperform (or under perform) frequently the rest of the indexes used in the analysis. This shall result in a forecast error for the strategy fitted using a constant relative proportion strategy which shall attribute to the fund manager a positive contribution to the fund result. On the contrary, temporary deviations of the return of one index from the returns of the others shall result, in the comparison of the strategies, in favor of the constant proportion strategy.58 In the case of a positive trend of, say, an index with respect to the rest of the portfolio, a buy and hold strategy does not rebalances by selling some of the same index and buying the rest of the portfolio. In case of a further over performance of the index the buy and hold portfolio shall over perform the rebalanced portfolio. In the case of a negative trend of some index with respect to the rest of the portfolio the constant proportion strategy must buy some of the under performing index selling some of the rest of the portfolio, if the under performance continues this strategy shall imply an over performance of the buy and hold strategy with respect to the constant relative proportion strategy. On the contrary, a strategy investing in temporary losers (after the loss!) or disinvesting in temporary winners shall outperform a buy and hold strategy in a oscillating market. 58 176 A second critique, of theoretical interest but hardly relevant in practice is connected with Roll’s critiques to CAPM tests and, more in general, to CAPM based performance evaluation. If the Constant proportion strategy does not contain all the indexes required for composing an efficient portfolio, any investment by the fund manager into the relevant excluded indexes shall result in an over performance. This would be relevant if the evaluated fund manager should know, ex ante, the style model with which his/her strategy shall be evaluated AND if the fund manager has a more thorough information on the structure of the efficient portfolio. The point is that, while it is rather easy to compute an efficient portfolio ex post, this is not so easy ex ante. Moreover, if we accept the idea that the style decomposition depends on the information of the analyzer, this critique loses much of its stance. A third, and more subtle, critique can be raised to style analysis as well as to any OLS based factor model used for performance evaluation. If the model is fitted to the fund returns, the variance (or sum of squares, if no intercept is used) of the replicating strategy shall always be less than or equal to that of the fund returns. In a CAPM or APT logic this is not a problem, since only non diversifiable risk should be priced in the market. However, as stressed above, we are NOT in a CAPM or APT world. With this lack of variance we are giving a possible advantage to the fund. Ways for correcting this problem can be suggested and, in fact, performance indexes which take into account this problem do exist. However, since, as we saw above, the positive (for the fund) result is already a weak result in style analysis, this undervaluation of the variance is only another step in the same direction: negative valuations are strong, neutral or positive valuations could be challenged. A last word of warning. Many data providers and financial consulting firms sell style analysis. As far as I know, the advertising of commercial style models invariably asserts the ability of such models to discover the true composition of the fund portfolio and most reports produced by style analysis programs concentrate on the time evolution (estimated by some rolling window OLS regression) of portfolio compositions. This is quite misleading (Sharpe is somewhat responsible as in the original papers he seems to share this opinion) and can be accepted only if interpreted as a misleading way to asses the true purpose of the strategy, that is: return replication. As far as I know, the typical seller and user of style analysis, if not warned, tends to believe the “fund composition” story. This false ideas usually disappears after some debate, provided, at least, that the user or seller is even marginally literate in simple quantitative methods. Examples Exercise 10-Style Analysis.xls 177 11 11.1 11.1.1 Factor models and principal components A very short introduction to linear asset pricing models What is a linear asset pricing model Let us begin by considering the following plot. Here you find yearly excess total linear return averages and standard deviations for those stocks which were in the S&P 100 index from 2000 to 2019 (weekly data, 83 stocks). As you can see, stocks with similar average total return show very different standard deviations and viceversa. We know that the statistical error in the estimate of expected returns using average returns may be big, however, if we believe that average return has anything to do with expected return and standard deviation with risk the plot is puzzling (and these are 84 BIG companies). We can see asset pricing models as tools devised to answe the kind of puzzles which plots like this one may raise. Among these two of the oldest and most relevant questions of Finance: 1. in the market we see securities whose prices evolve in completely different ways. There may even be securities that have both mean returns lower and standard deviations of returns higher than other securities. Why are all these securities, with such apparently clashing statistical behaviours, still traded in equilibrium? 2. which are the “right” equilibrium relative prices of traded securities? 178 (Not be puzzled by the fact that we speak of asset pricing models and we write returns. Given the price at time 0, the return between time 0 and time 1 determines the price at time 1). We anticipate here the answers to these two questions given by asset pricing models: 1. securities prices can be understood only when securities are considered within a portfolio. Completely different (in terms, say, of means and variances of returns) securities are traded because they contribute in improving the overall quality of the portfolio (in the classic mean variance setting this boils down to the usual diversification argument), what is relevant is not the total standard deviation of each security but how much of this cannot be diversified out in a big portfolio, for this reason the expected return of a security return should not be compared with its standard deviation but only with the part of this standard deviation which cannot be diversified ; 2. the right (excess) expected returns of different securities should be proportional with the non diversifiable risks “contained” in the returns, to equal amounts of the same risk should correspond equal amounts of expected return. These are not the only observed properties of asset prices/returns asset pricing models try to account for. Another striking property is as follows: while thousands of securities are quoted, there seems to be a very high correlation, on average, among their returns. In a sense it is as if those many securities were “noisy” versions of much less numerous “underlying” securities. For instance, the 83 stocks of the S&P 100 displayed above show an average (simple) correlation (over 20 years!) of .31. If we recall the discussion connected with the spectral theorem and compute the eigenvalues of the covariace matrix of these returns, while we see no eigenvalue equal to 0, the sum of the first 5 eigenvalues is greater the 50% of the sum of all eigenvalues, where the last, say, 50 eigenvalues count for about 15% of the total. The sole first eigenvalue is about 1/3 of the total. This suggest the idea that, while not singular, the overall covariance matrix can be well approximated by a singular covariance matrix. It should be clear how to answer to these questions and to model the the high average correlation of returns is important for any asset manager and, in fact, asset pricing models are central to any asset management style not purely based on gut feelings. We can deal with these problems within a simple class of asset pricing model known as: “linear (risk) factor models”. Here we show some hint of how this is done in practice. An asset pricing model begins with a “market model” that is a model which describes asset returns (usually linear returns) as a function of “common factors” and “idiosyncratic noise”. These models are, most frequently, linear models and a typical market model for the 1 × m vector of excess returns rt observed at time t, the 1 × k 179 vector f of “common risk factors” observed at time t and and the 1 × m vector of errors is: rt = α + ft B + t Where B is a k × m matrix of “factor weights” and α is a 1 × m vector of constants. We suppose to observe the vectors rt and ft for n time periods t. Stacking the n vectors of observations for rt and ft in the n × m matrix R and the n × k matrix F , and stacking the corresponding error vectors in the n × m matrix we suppose: E(|F ) = 0, V (t |F ) = Ω and E(t 0t0 |F ) = 0 ∀t 6= t0 . In order to give meaning to the term “idiosyncratic” the contemporaneous correlation matrix Ω is, as a rule, supposed to be diagonal, typically with non equal variances. It is relevant to stress the fact that such a time series model can be a good explanation of the data on R (for instance it may show high R2 for each return series) and at the same time no asset pricing model could be valid. Let us recall that, if we estimate the market model with OLS (this may be done security by security or even jointly) The OLS estimate of α can be written in a compact way as α̂ = r̄ − f¯B̂ Where r̄ is the 1×m vector of average excess return (one for each security excess return averaged over time), f¯ is the 1 × k vector of average common factors values (again: averaged over time) and B̂ is the matrix of OLS estimated factor weights (one for each factor for each security: k × m. The expected value of this, under the above hypotheses, is: E(α̂) = E(r) − E(f )B As we shall see in a moment, an asset pricing model is valid, if, supposing Ω diagonal, we have that α = 0. This is usually written as: E(r) = λB Where λ = E(f ) is a 1 × k vector of “prices of risk” and in a moment we shall see why this name is used. It is now important to stress that this restriction may hold, so that the asset pricing model is valid, and this not withstanding the time series model could offer a very poor fit of r or, on the contrary, the fit could be very good and α 6= 0.59 For asset management purposes, however, a possible good fit for the time series model with a k << m could be very useful even when the asset pricing model does not hold. Beware: what just described as a possible test of an asset pricing model is useful for the understanding of the loose interplay between the time series model and the asset pricing model but it is, typically, not a very efficient way, from the statistical point of view, to test the validity of an asset pricing model. 59 180 Suppose, for instance, you want to use a Markowitz model for your asset allocation. In order to do this you need to estimate the variance covariance matrix of returns. This requires the estimate of m(m + 1)/2 unknown parameters using n observations on returns. With a moderately big m this could be an hopeless task. Suppose now the market model works, at least in the time series sense, in the sense that the R2 of each of the m linear models is big. In this case the variances of the errors are small and: V (rt ) = B 0 V (ft )B + Ω ∼ = B 0 V (ft )B Let us now count the parameters we need to estimate the varcov matrix of the excess returns with and without the market model. Without the market model, the estimation of V (rt ) would require the estimation of m(m + 1)/2 parameters, while with the factor model it requires the estimation of k × m + k × (k + 1)/2 parameters, that is B and V (ft ). Suppose for instance m = 500 and k = 10, the direct estimation of V (rt ) implies the estimation of 125250 parameters while the (approximate) estimate based on the factor model “only” 5000+55 parameters. The reader should notice that, even if the above assumptions for V (rt ) are right, the use of B 0 V (ft )B in the place of the full covariance matrix shall imply an underestimation of the variance of each asset return, which is going to be negligible only if all the R2 are big. Let us move on a step. We must remember that our aim is the construction of portfolios of securities with weights w and excess returns rt w. In this case we are not necessarily interested to the full V (rt ) but to variance of the portfolio V (rt w) = w0 B 0 V (ft )Bw + w0 Ωw It is well possible that w0 Ωw be small, so that w0 B 0 V (ft )Bw be a good approximation of V (rt w), even if it is not true that all R2 are big and, by consequence, the diagonal elements of Ω small. Suppose that the weights w of different securities in this portfolios are all of the order of 1/m. This simply means that no single security dominates the portfolio. We have, then m m X 1 X ωi 0 2 w Ωw = wi ωi ≈ m i=1 m i=1 and this, with bounded, but not necessarily small, diagonal elements of Ω: ωi , goes to 0 for m going to infinity. This means that, for large, well diversified, portfolios “forgetting” Ω is irrelevant even if its diagonal elements are not small. The hypothesis of a diagonal Ω, that is: idiosyncratic t is fundamental for this result. From this result, we can shed some light on the reason why we should have E(r) = E(f )B = λB, that is: why an asset pricing model should hold. 181 In order to understand this, it is enough to compute the expected value and the variance of our well diversified portfolio (notice the approximation sign for the variance) E(rt w) = E(rt )w = αw + E(ft )Bw. V (rt w) ∼ = w0 B 0 V (ft )Bw Suppose now α 6= 0, recall that B is a k × m matrix with (supposedly) k << m and we can always suppose that the rank of B is k (if this is not the case we can reduce the number of factors). This implies that the matrix B 0 V (ft )B ia m × m matrix of rank k < m. The matrix B 0 V (ft )B is then SEMI positive definite, this implies that there exist m − k orthogonal vectors z such that z 0 z = 1 and z 0 B 0 V (ff )Bz = 0. According to what discussed in the matrix algebra section and in the presentation of the spectral theorem, under conditions we do not specify here, we can always build from these a set of weights w$ such that w$0 1 = 1 and αw$ > 0. You should understand the reason of the dollar sign. The vector w$ is such that it defines a zero risk portfolio (zero variance) with positive excess return αw$ (since the variance is zero the expected excess return becomes the excess return). In other words, we created a risk free security (the portfolio) which yields a return (arbitrarily) greater than the risk free rate. This is an “arbitrage” as one could borrow any amount of money at the risk free rate and invest it in the portfolio with a positive profit and no risk (hence the $). Provided all the financial operations involved (building the portfolio, borrowing money etc.) are possible, this should not happen if traders are “reasonable” (and if they know of the existence of the factor model). The only way to unconditionally (that is: whathever tha choice of w$ ) avoid this is that α = 0 so that E(r) = E(f )B = λB Let us now give a “financial interpretation” of this result. Since each element βji of B represents the “amount” of non diversifiable factor fj in the excess return of security i and E(fj ) represents the excess expected return of a security which has a “beta 1” with respect to the j factor and zero with respect to the others (if the factor fj is the excess return of a security, this could simply be the excess return of that security, but this is not required) we may understand the name “price of risk for factor j” used for E(fj ) = λj and risk premium for factor j” given to the “price times quantity” product fj βji . Now that we have a rough idea of how an asset pricing model works, it could be useful go back to the questions with which this section begun and think a little bit about how the answers come from the asset pricing model. We should first notice that the approximation V (rt w) = w0 B 0 V (ft )Bw + w0 Ωw ∼ = w0 B 0 V (ft )Bw 182 is a formal interpretation of the empirical fact that correlations among quoted securities returns are on average high. The interpretation is based on the idea that all returns depend (in different ways) on the same underlying factors and what is “not factor” is uncorrelated across returns. For this reason, well diversified portfolios of securities tend to show returns whose variance only depends on that of factors. As a consequence, it shall be difficult to build many such well diversified portfolios which are not correlated among them. In fact, if we assume the above approximation to be exact, and V (ft ) to be non singular, only exactly k of such portfolios can be built, the choice unique modulus an orthogonal transform. In this case it is quite tempting to interpret any choice of such k non correlated portfolios as a “factor estimate”. Some aspects of this idea shall be reconsidered when presenting the “principal component” way to risk factor estimation. Asset pricing models give a very precise answer to the puzzle about the fact that securities are traded in the market even if they may show, at the same time, lower average returns and higher standard deviations of other traded securities. This is a possible equilibrium because what is relevant is not the “absolute risk” (marginal standard deviation of returns) of a security, but its contribution to the risk/return mix in a well diversified portfolio. For this reason, we can see a relatively low average return and a high standard deviation simply because the security showing these statistics as little correlation with systematic risk factors. The model tells us many other interesting things regarding this point. For instance: it tells us that us that if we see two securities with, say, the same average returns and very different return standard deviations, the correlation between the returns of these securities should be small. Last: asset pricing models give us formulas for measuring the “right mix” of expected returns and correlation with systematic risk factors (betas) and this answers to the question about right equilibrium relative prices. On this basis, asset pricing models give us an unified framework to precisely quantify and test the equilibrium price system and to transform the statistical results into asset management tools. 11.1.2 Tests of the CAPM Empirical analysis of asset pricing model is of central importance for asset management. This is and introductory Econometrics course and it is not the place for a detailed analysis of how to test and asset pricing model. It can, however, be useful just to give a quick idea about how this could be made in the case of the prototype of all asset pricing models: the CAPM. The CAPM is a single common risk factor model where the risk factor is the excess return of a “market portfolio”: rM . According to CAPM, the expected assets excess return E(ri ) are proportional to 183 the assets betas, the proportionality constant being equal to market excess return: E(r1 ) = β1 E(rM ) E(r2 ) = β2 E(rM ) ... E(rm ) = βm E(rM ) In the following we explain how linear regression can be used to test CAPM. The kind of test of the CAPM described here is a quite simple and naive one. Similar to the first empirical analyses of CAPM at the end of the sixties. Much has been done in the following fifty years but this is not a topic for this course. We want to test whether E(ri ) = βi E(rM ) i = 1, . . . , m. In this equation, βi is the independent variable and E(ri ) is the dependent variable: in fact CAPM asserts that E(ri ) is a linear function of βi . Since E(ri ) and βi are not observable, we must estimate them. E(ri ) is estimated with ri and βi is estimated by OLS on the factor model as in the previous paragraph. We consider the regression equation which asserts that ri is a linear function of β̂i plus an error term (we need to insert an error term since we use estimates) ri = γ0 + γ1 β̂i + i i = 1, . . . , m This is called second-pass regression equation. It is a cross-sectional regression unlike the time series regression of the factor model (in the factor model regression the observations refer to different times, here the observations refer to different assets): r1 = γ0 + γ1 β̂1 + 1 r2 = γ0 + γ1 β̂2 + 2 ... rm = γ0 + γ1 β̂m + m If CAPM is valid, then γ0 and γ1 should satisfy γ0 = 0 and γ1 = rM , where rM is the mean market excess return. In fact, however, you can go a step further and argue that the key property of the expected return-beta relationship of CAPM asserts that the expected excess return is determined only by beta. In particular, if CAPM is valid, the expected excess return should be independent on non systematic risk, as measured by the variances of the residuals σi2 , which also are estimated by the factor model. Furthermore the 184 dependence on beta should be linear. To verify both the conclusions of CAPM, you can consider the augmented regression model r1 = γ0 + γ1 β̂1 + γ2 β̂12 + γ3 σ12 + 1 r2 = γ0 + γ1 β̂2 + γ2 β̂22 + γ3 σ22 + 2 ... rm = γ0 + γ1 β̂m + γ2 β̂b2m + γ3 σ1m + 1 and test γ0 = 0, γ1 = rM , γ2 = 0, γ3 = 0. There are several difficulties with the above procedure. First and foremost, stock returns are extremely volatile which reduces the precision of any test. In light of the asset volatility, the security betas and expected returns are estimated with substantial sampling error. A possible improvement is that of grouping returns in portfolios instead of using them one by one. A classic procedure based on this idea begins with ordering the average returns by estimated beta value. The m average returns are then grouped in, say, q portfolios and for each portfolio is computed the average beta for the components of that portfolio. The second step regression is then run using as dependent variables the average of the returns for each portfolio and for regressor the average beta. Averaging should decrease the sampling error implied by the market model regressions. As written above, this course is not the place for a more detailed study of how to test asset pricing models. There is neither place nor opportunity to discuss empirical successes and insuccesses of asset pricing model (the theoretical and empirical literature is huge and the debate rages on since the invention of CAPM almost 60 years ago). Just a quick glimpse in the S&P 100 data used above. Regress each excess return on the excess return of the market index and take the residuals for each of the 83 regressions. The question is if the residuals of this regression are “idiosyncratic”. While the original excess returns are almost invariably positively correlated, residuals show positive and negative correlations. A good measure of “overall correlation”, then, is the sum of the squared element of the correlation matrix. If we take the ratio of the sum of squares of the correlation matrix for the residuals and for the original data we get .23. The simple beta model, then, reduces the measure of overall correlation to less than one quarter of the original value. We deduce that, while other factors may be necessary, the single beta model takes us a long way toward the separation between systematic and idiosyncratic “risk”. Asset pricing models are central both for asset management and for Corporate Finance, for this reason they constitute a mainstay of Finance education. For the interested Reader a good starting point would be: Kennet J. Singleton “Empirical Dynamic Asset Pricing: Model Specification and Econometric Assessment”, 2006, Princeton University Press. 185 11.2 Estimates for B and F When the factors F are observable variables, the matrix B can be estimated using OLS (in fact a slightly better estimate exists but it is outside the scope of these notes). This, in principle, is what we did with the style model which could, with some indulgence, be considered as the “market modes” part of an asset pricing model. In fact, for the style analysis method to work it is not strictly necessary that the style model corresponds to a full market model. This is due to the fact that, in style analysis, the model is used as a reference benchmark only. The joint use of a benchmark model which is also a market model would, in any case, be in theory a more coherent choice. We also discussed this in the case of the CAPM. In the CAPM there exists a single common factor, represented by the wealth of agents, intended as everything that impacts agents utility, as risked on the market. This cannot be directly observed and is proxied in practice by some market index and m idiosyncratic factors supposed to be uncorrelated with the common factor and among themselves. If we believe in the quality of the proxy for the wealth of agents, an OLS estimate shall work also in this case. The typical asset pricing model uses as factors some CAPM like index and observable macroeconomic variables, The Fama and French model is a CAPM plus two long/short portfolios for value stocks (low against high price to book value) and size stocks (later a momentum portfolio was added). A huge academic industry in “finding relevant risk factors” to “explain the cross section of stock returns” (recall the second stage regression) arose from this with hundreds of papers and suggested risk factors60 . Current proactitioner models, widely used in the asset management industry for asset allocation, risk management and budgeting and performance evaluation, include, For those interested, read: Campbell R. Harvey, Yan Liu, Heqing Zhu, “. . . and the Cross-Section of Expected Returns” The Review of Financial Studies, Volume 29, Issue 1, January 2016, Pages 5–68. In this very interesting and funny paper the Authors attempt a wide review of the risk factors suggested for market models in published papers up to 2015. They consider 313 papers and 316 different, but often correlated, factors The Authors are very clear about the fact that this is, actually, not a complete review of the published and unpublished research on the topic. The Authors summarize the results and stress the important statistical implications due to the fact that, using, in the vast majority, data on the US stock market or on markets correlated with this, these papers are not based on independent experiments or observational data, but on what are, in essence, different time sections of the same dataset. This is a classic case of the “data mining”, “multiple testing” or “exhausted data” problem, sometimes also called “pretesting bias”. In this case many, in general dependent, tests are run on the same dataset. Often, tests are chosen and run conditional to the result of other tests. This requires a very careful assessment of the joint P-value of the testing procedure which cannot be reduced to a test by test analysis. The result of such assessment is that individual test should be run under increasingly stringent “significance” requirements when new hypotheses are tested in addition to old ones. This quickly makes impossible to test new hypotheses on the same “exhausted” dataset. 60 186 for what is my experience, roughly from 10 to 15 risk factors and are tuned to specific asset classes, so that they do not pretend to be general market models. All these models can, in principle, be dealt with by regression methods. There is, however, a different attitude toward factor modeling. This attitude attempts a representation of underlying unobserved factors based on portfolios of securities which are not defined a priori but jointly estimated with the model optimizing some “best fit” criterion. In order to do this, we need a joint estimation of F , the matrix of observation on all factors, and B the factor weights matrix. A common starting point is that of requiring the factors ft to be linear combinations of excess returns: ft = rt L. In principle there exist infinite choices for L. A unique solution can be chosen only by imposing further constraints. Each choice of constraints identifies a different set of factors. Most frequently, factor models of this kind are based on the principal components method or on variants of this. The principal components method is a classic data reduction method for Multivariate Statistics which has received a lot of new interest with the growth of “big data”. In Finance principal components are used at least starting with the nineteen sixties/seventies. We can describe the procedure of “factor extraction” that is: the unique identification/estimation of factors, in two different but equivalent ways. Both methods require, implicitly or explicitly, an a priori, maybe very rough, estimate of V (rt ). For this to be possible a fundamental assumption is that V (rt ) = V (r) that is: the variance covariance matrix of excess total returns is time independent. When this is not assumed to hold, more complex methods than simple principal components are available but are well beyond the scope of these notes. 11.2.1 Principal components as factors As a starting point, suppose that the variance covariance matrix of a 1 × m vector of returns r: V (r) is known. We introduce the principal components, at first, in an arbitrary way. In the following subsection we shall justify the choice. From the spectral theorem we know that V (r) = XΛX 0 . By the rules of matrix product and recalling that Λ is diagonal, we have: X XΛX 0 = xi x0i λi i where xi is the i−th column of X and the sum is from 1 to m. 187 Notice that, in general, only k eigenvalues of λj are greater than 0 while the others are equal to 0. Here k is the rank of V (r). For simplicity in the following formulas we suppose k = m but with proper changes of indexes the formulas are correct in general. Define the “principal components” as the "factors" (and remember; principal components are linear combinations of returns) fj = rxj and regress r on fj 61 . These are m univariate regressions and the “betas” (one for each return in r) of this regressions are, as usual62 : βj = E(x0j r0 r) − E(x0j r0 )E(r) x0j V (r) Cov(fj ; r) = = V (fj ) V (fj ) V (fj ) However: x0j V (r) = x0j XΛX 0 = x0j X xi x0i λi = x0j λj i and V (fj ) = V (rxj ) = x0j XΛX 0 xj = λj so that: βj = x0j Let us now find V (r − fj βj ) V (r − fj βj ) = V (r − rxj x0j ) = [I − xj x0j ]V (r)[I − xj x0j ] = = [I − xj x0j ]XΛX 0 [I − xj x0j ] = [XΛX 0 − λj xj x0j ][I − xj x0j ] = = [XΛX 0 − λj xj x0j − λj xj x0j + λj xj x0j ] = = XΛX 0 − λj xj x0j = X 0 xi x0i λi = X−j Λ−j X−j , i6=j where X−j and λ−j are, respectively, the X matrix dropping column j and the Λ matrix, dropping row and column j. In other words, the covariance matrix of the “residuals” r − fj βj0 has the same eigenvectors and eigenvalues of the original covariance matrix with the exception of the eigenvector and eigenvalue involved in the computation of fj .63 Here the regression is to be intended as the best approximation of ri by means of a linear transformation of fj . The intercept is included, see next note. 62 Notice that the definition of βj here employed implies the use of an intercept. We have not mentioned it here, since we are interested to the variance-covariance matrix of r, which is unaffected by the constant. In any case, the value of the constant 1 × m vector α is E(r) − E(f )β = 0 63 A fully matrix notation makes the derivations even simpler, if less understandable. 61 188 This result is due to the orthogonality of factors64 and has several interesting implications. We mention just three of these. First: one by one “factor extraction”, that is: the computation of f ’s and corresponding residuals, yields the same results if performed in batch or one by one. Second: the result is invariant to the order of computation. Third: once all factors are considered the residual variance is 0. This last obvious result can be written in this way. If we set F = rX we have r = F X 0 . Grouping in Fq and Xq the first q factors and columns of X and in Fm−q and Xm−q the rest of the factors and columns of X we have: r= m X i=1 fi x0i = q X fi x0i i=1 + m X 0 fi x0i = Fq Xq0 + Fm−q Xm−q i=q+1 Which we are tempted to write as: r = Fq Xq0 + e Now recall the initial factor model (we drop the t suffix for the moment): r = fB + It is tempting to equate Fq to f and Xq0 to B for some q. At the same time it is tempting to equate e with 65 . Now, given the above construction, it is always possible to build such a representation of r. The question is whether, given a pre specified model r = f B + , the above described method shall identify f, B and . The answer is: “in general not”. In fact the two formulas are only apparently similar and become identical only under some hypothesis. These are: if we suppose V (r) invertible with eigenvector matrix X we have rX = F and immediately r = F X 0 so principal components are linear combinations of returns and vice versa. Moreover V (F ) = V (rX) = X 0 V (r)X = X 0 XΛX 0 X = Λ that is: principal components are uncorrelated and each has as variance . the corresponding eigenvalues. Then, if we split X vertically in two sub matrices X = X1 ..X2 we X10 .. .. . have rX = F = rX1 .rX2 = F1 .F2 and r = F X 0 = F1 ..F2 ... = F1 X10 + F2 X20 where X20 0 0 V (F2 X2 ) = X2 Λ2 X2 . Since principal components are uncorrelated this implies that, whatever the number of components in F1 , their regression coefficients shall always be the same and correspond to the transpose of their eigenvectors (the first statement is a direct consequence of non correlation and the second was demonstrated in the text.) In matrix terms: the “linear model” estimated with OLS: r = F1 B̂1 + Û1 holds with B̂1 = X10 and Û1 = F2 X20 . 64 Orthogonality here means that the factors are uncorrelated. 65 It could be argued here that the expectation of e is not zero. Recall, on the other hand, that the expected returns are typically nearer zero than most observed returns, due to high volatility. This is particularly true when daily data are considered. Moreover the non zero mean effect is damped down by the ”small” matrix Xm−q . Hence the expected value of e can be considered negligible. 189 1. The dimension of f is q. 2. V (f ) is diagonal. 3. BB 0 = I 4. The rank of V () is m − q and the maximum eigenvalue of V () is smaller than the minimum element on the diagonal of V (f ). To these hypotheses we must add the already mentioned requirement that f and are orthogonal. For any given f B the second and third hypothesis can always be satisfied if V (f B) is of full rank. In fact, in this case, it is always possible, using the procedure described above, to write f B = f˜B̃ where the required conditions are true for f˜B̃ (remember that, if the f are unobservable, there is a degree of arbitrariness in the representation). Hypothesis one is more problematic: all we observe is r and we do not know, a priori, the value of q. But the most relevant (and interesting) hypothesis is that the rank of V () is m − q and its eigenvalues are all less than the eigenvalues of V (f B). This may well not be the case and in fact we could consider examples where is a vector orthogonal to the elements of f but V () is of full rank and/or its eigenvalues are not all smaller than those of V (f B). For instance: in classical asset pricing models (CAPM, APT and the like) the main difference between residuals and factors is not that the variance contributed by the factors to the returns is bigger than the variance contributed by “residuals” but that factors are common to different securities, so that they generate correlation of returns, while residuals are idiosyncratic that is: they should be uncorrelated across securities. While principal component analysis guarantees zero correlation across different factors, residuals in the principal component method are by no means constrained to be uncorrelated across different securities. In fact, since the varcov matrix of residuals is not of full rank, some correlation between residual must exist and shall in general be higher if many factors are used in the analysis66 . While this is not the place for a detailed analysis of this important point, it is useful to introduce it as a way for remembering that r = Fq Xq0 + E is, before all, a If the row vector z of k random variables has a varcov matrix A such that Rank(A) = h. Then at most h linear combinations of the elements of z can be uncorrelated. The proof is easy. Suppose a generic number g of uncorrelated linear combinations of z exist and let these g linear combinations equal to u = zG and suppose, without loss of generality, that by a proper choice of the G weights the variance of each u is 1. Since the u are uncorrelated we have V (u) = G0 AG = Ig . Since the rank of Ig is g, the rank of G, which is a k × g matrix is at most g and the rank of a product is less then or equal to the minimum rank of the involved matrices we have that the rank of A would by necessity be bigger than or equal to g but, by hypothesis, we know it to be equal to h so g cannot be bigger than h (we could go on and show that it is in fact equal to h but we only wanted to show that AT MOST h linear combinations could be uncorrelated). 66 190 representation of r and only under (typically non testable) hypothesis some estimate of a factor model. In our setting we need the representation in order to simplify the estimation of V (r), while the interpretation of the result as the estimation of a factor model is very useful, when possible, the simple representation shall be enough for our purposes. It should always be remembered that our purpose is not the precise estimation of each element of V (r). What we really hope for is a sensible estimate of the variance of reasonably differentiated portfolios made with the returns in r. In this case, even if the estimate of V (r) is rough it may well be that the estimate of a well differentiated portfolio variance is fairly precise as, by itself, differentiation shall erase most of the idiosyncratic components in the variance covariance matrix. This intuitive reasoning can be made precise but it is above the purpose of our introductory course. A last point of warning is required. If we use enough principal components, then Fq Xq0 behaves almost as r (the R2 of the regression is big). The almost clause is important. Suppose you invest in a portfolio with weights xq+1 /xq+1 1m that is, a a portfolio with correlation 1 with the first excluded component (the denominator of the weights is there in order to have the portfolio weights sum to 1). By construction the variance of this portfolio is λq+1 /(xq+1 1m )2 . However the covariance of this portfolio with the included components is zero. In other words: if we measure the risk of any portfolio by computing its covariance with the set of q principal components included in the approximation of V (r) we shall give zero risk to a portfolio correlated with one (or many) excluded components. The practical implications of this are quite relevant but, a thorough discussion is outside the purpose of these handouts. However: beware! The question now is: we introduced the factor/components F in an arbitrary way deriving them from the spectral theorem. Are there other justifications for them? 11.3 Maximum variance factors In the preceding section we derived a principal component representation of a return vector by comparing the spectral theorem with the general assumptions of a linear factor model. Here we follow a different path: we characterize each principal component (suitably renormalized) as a particular “maximum risk” portfolio with the constraint that each component must be orthogonal to each other component and that the sum of squared weights should be equal to one. Linear combinations of returns are (up to a multiplicative constant) returns of (constant relative weights) portfolios67 . Given a set of returns it is interesting to answer As hinted at in several places of these handouts, given a linear combinations of returns, there exist at least two ways of converting this in the returns of a portfolio. If we only want the required portfolio 67 191 the question: which are the weights of the maximum variance linear combination of returns? (We repeat: this is not the same of the maximum variance portfolio). This problem is not well defined as the variance of any portfolio (provided it is not 0) can be set to any value by multiplying its weight by a constant. It could be suggested to constrain the sum of weights to one, however this does not solve the problem. Again, By considering multiples of the different positions the requirement can be satisfied and the variance set to any number, at least, if weights are allowed to be both positive and negative. A possible solution is to set of absolute values of the weights to one. This would both solve the problem and have a financial meaning. Alas, this can be done but only numerically. Suppose instead we set the sum of squared weights to 1. This solves the bounding problem with the inconvenient that the resulting linear combination shall in general not be a portfolio. But this choice yields an analytic solution. Let us set the mathematical problem: max V (rθ) = max θ0 V (r)θ 0 θ:θ0 θ=1 θ:θ θ=1 The Lagrangian for this problem is: L = θ0 V (r)θ − λ[θ0 θ − 1] So that the first order conditions are: V (r)θ − λθ = 0 and θ0 θ = 1 Rearranging and using the spectral theorem we have: [XΛX 0 − λI]θ = 0 We see that, if we set θ = xj and λ to the corresponding λj , for any j we have a solution of the problem. Since V (rXj ) = λj , the solution to the maximum variance problem is given by the pair x1 and λ1 where, as usual, we suppose the eigenvalues sorted by size. to be perfectly correlated with the given linear combination, all that is needed is to renormalize the weights by dividing them for their sum (provided this is not zero). If we wish for a portfolio with the same weights (on risky assets) and the same variance as those of the linear combination, we must simply add to the linear combination the return of a risk free security with as weights the difference between one and the sum of the linear combination’s weights. Notice that in this second case, while the (one time period) variance of the linear combination shall be the same as the variance of the portfolio return (the risk free security has no variance for a single time period) the expected value shall be different. In fact, if the weight of the risk free security is greater than zero, the expected value of the portfolio return shall be (with a positive return assumed for the risk free security) greater that the expected value of the linear combination of returns, the opposite in case of a negative weight 192 From what discussed in the previous section, the other solutions can be seen as the maximum variance linear combinations of returns, where the maximum is taken with the added constraint of being orthogonal to the previously computed linear combinations. We see that the components defined in a somewhat arbitrary way in the previous section become now orthogonal (conditional) maximum variance linear combinations. 11.4 Bad covariance and good components? Suppose now that V (r) is not known. In particular our problem is to estimate such a matrix when m, the number of stocks, is big (say 500-2000). What we wrote up to this point suggests a way for simplifying a given variance covariance matrix using principal components. What happens when the variance covariance matrix is not given and we must estimate it? Obviously we could start with some standard estimate of V (r). For instance, suppose we stack in the n × m matrix F our data on return and estimate V̂ (r) = F 0 F/n − F 0 1n 10n F/n2 where 1n is a column vector of n ones. Then we could proceed by extracting the principal components from V̂ (r). It could be a puzzle for the reader the fact that, in order to estimate the factor model, whose purpose is to make it possible a sensible estimate of the covariance matrix, we need some a priori estimate of the same matrix. A complete answer to this question is outside the scope of these notes (this sentence appears an annoying number of times. Doesn’t it?), however, the intuition underlying a possible explanation is connected with the fact that, in principle, the principal components could be computed without an explicit a priori estimate of V (r). Given a sample of n observations on rt that is R, all that is needed is, for instance in the case of the first component, to find the vector x1 of weights such that the numerical variance of f1 = Rx1 is maximum (with the usual constraint x01 x1 = 1. This can be done iteratively for all components. The idea is that, even if the full V (r) is difficult to estimate, it may be possible to estimate the highest variance components while the estimation problems are concentrated on the lowest variance components. More formally: we estimate V (r) with some V̂ (r) = V (r) + were is a positive definite error matrix. the spectral decomposition for both matrices as : V (r) = P P Write 0 0 j xj xj λj and = j ej ej ηj . Our hope is that the highest of the error eigenvalues ηj is smaller of at least some of V (r) eigenvalues. In this case the estimate error shall affect the overall quality of the estimate V̂ (r) but only with respect to the lowest eigenvalue components. In summary. The principal components are defined as ft = rt X where X are the eigenvectors of the return covariance matrix. The principal components are uncorrelated return portfolios (recall that a constant coefficients linear combination of returns is the return of a constant relative proportion strategy, moreover recall that the sum of 193 weights in the principal component portfolios is not one). The variances of the principal component are the eigenvalues corresponding to the eigenvectors which constitute the portfolio weights. We can derive a solution to the problem rt = ft B by simply setting B = X 0 . The percentage of variance of the j−th return due to each principal component can be computed by taking the square of the j−th column of X 0 and dividing each element of the resulting vector for the total sum of squares of the vector itself. A simple PC analysis on a set of 6 stock return series can be found in the file “principal components.xls”. A more interesting dataset containing total return indexes for 49 of the 50 components in the Eurostoxx50 index (weekly date) can be found in the file “eurostoxx50.xls”. Principal components were computed using the add-in MATRIX. Examples Exercise 11 - Principal Components.xls Exercise 11b - PC, Eurostoxx50.xls 12 Black and Litterman The direct estimation, based on historical data, of the expectation vector for a set of stock returns, seems to be doomed to failure in any conceivable real world circumstance. As considered in a previous section, this is mainly due to the fact that the typical estimate has a standard error, in yearly terms, which is in the range of 25-35% divided by the square root of the number of yearly data available in the dataset (notice that the use of daily, monthly or yearly data makes no difference, at least for log returns. For an estimate to show a 95% confidence interval of reasonable size (say ±5%) we should then use data for a number of years in the range of 100-120 and, for many reasons, this is quite an unlikely sample size. Typical sample sizes for reliable and homogeneous data on single stocks are 5-10 and these imply 95% confidence intervals sizes in the range of ±20%. For this reason any direct use of portfolio optimization methods of the Markowitz kind shall imply unreasonable allocations: too much sampling error in the estimation of the expected return and, due to this, an asset allocation which shall be idiosyncratic to the specific estimation sample. This is in fact what we observe in practice: from the direct use of historical data we derive allocation weight wildly varying with time, completely different across stocks and, often, unreasonably extreme. The ex post, i.e. derived from historical estimates, optimal portfolio seems to be highly specific to the time of estimation. Accordingly, the optimal allocation with the Markowitz model changes a lot from time to time while 194 the market allocation, that is the relative total capitalization of traded assets is neither extreme nor so wildly time varying. A possible solution is: do not bother with optimization and use some standard allocation rule. This could be simply the market allocation, as approximated by some wide scope index, or some equally weighted portfolio or any other reasonable choice. Obviously, due to the same problem described above, the expected value of any allocation containing a sizable stock proportion shall be difficult to estimate. Notice, moreover, that any problem is estimating the expected value has two faces: first it is difficult to estimate the parameter but second, and most important, even when the parameter were known, it is very difficult to assess which average return you shall get out of your investment on a reasonable period of years. For this second problem we can do noting. For the first problem some useful suggestion was made during the nineties. The most widely implemented suggestion is the so called Black and Litterman model. The basic idea is that of using as a starting point the market allocation and express our views as departures from this allocation. The main contribution of the method it to discipline the asset manager action. The asset manager is required to numerically specify the extent and the confidence of his/her views and the effect of this specification can be immediately checked in terms of departures from a reasonable (the market’s) allocation. This helps in avoiding what often can be seen in practice: wrong implementation of reasonable vies. Suppose that the market allocation is observable, at least for aggregate asset classes, and suppose that this allocation is a mean variance optimal allocation. This is obviously a rather strong hypothesis, however, the portfolio of the market (not the CAPM market portfolio) should at least be reasonable that is, rationally held, and a minimal requirement for this should be not too much mean variance inefficiency. Then if we make an hypothesis on the variance covariance matrix of asset returns (typically a data based estimate) and on the market risk aversion parameter we can invert the vector of market weights in order to find the implied vector of market expectations. In formulas, since the tangency portfolio weights, renormalized in order to have a total sum of weights given by 1, are: wmkt = λΣ−1 (µmkt − 1rf ) we have Σwmkt + 1rf λ This procedure is equivalent to implied volatility computation using quoted option prices and the Black and Scholes model. In both cases the result would be irrelevant had it not to be used for some further modeling: in the case of implied volatility, the market prices of options being known, the estimate is useful for computing other derivative prices or for evaluating hedges, in the case of Black and Litterman the market portfolio is required in order to compute the market expectations vector so, µmkt = 195 in absence of constraints of other information the investor should reasonably hold the market portfolio itself. In this case the knowledge of the market implied expectations would be irrelevant. On the contrary the idea of Black and Litterman is to conjugate the market implied expectations with further information (and, possibly, constraints) in order to derive a strategic allocation that can differ from that of the market.68 In Black and Litterman the private and market information of the asset manager are expressed as a distribution for the vector of expected returns: µR which is supposed to be a random vector. The private information is expressed by a set of views. In other words, the asset manager assess q expected values and variances of a set of q linear combinations (portfolio returns) of µR . On the other hand the market information is expressed by assessing that the expected returns of the assets in the market portfolio are the random vector µR with expected value equal to the market implied expected return and varcov matrix equal to a fraction of the observed returns varcov matrix. In formulas: let R be the k × 1 random vector of market returns with expected values µR , Σ its varcov matrix, P the q × k matrix for the weights of the q portfolios on which the asset manager expresses views. The market information is summarized by the hypothesis: µR ≈> Nk (µmkt ; τ Σ) here τ is a scalar smaller than one. The asset manager information is expressed by the hypothesis: P µR ≈> Nq (V, Γ) where V is a vector of expected values for the view portfolios and Γ a diagonal varcov matrix expressing the confidence in these views. We must now derive and expectation vector, say: µBL combining both market and private information to be used for portfolio optimization. There are different possible ways of deriving the Black and Litterman formula for the µBL vector. A simple way is to solve the following optimization problem: µBL = arg min(µ − µmkt )0 (τ Σ)−1 (µ − µmkt ) + (P µ − V )0 (Γ)−1 (P µ − V ) µ The interpretation of this problem is simple: we want to find the value of µ which minimizes the distance (weighted with the inverse of covariance matrices) with respect to the market implied value but also respects as well as possible the views of the fund The reader shall notice some lack of coherence in this attitude. The use of the market portfolio and the hypothesis that this portfolio is derived by mean variance optimization implies some degree of uniform market information. On the other hand, we suppose also that there exists some private information which the asset manager can pool with the market implied expectation. A possible justification of this, albeit quite informal, is in the fact that the asset manager portfolio size could be negligible with respect to the market. 68 196 manager. In more technical terms we reduce our problem to a weighted least squares problem. In the limit, when the diagonal elements of Γ go to zero, that is, when there is infinite confidence in the views, the problem becomes equivalent to a constrained least squares problem. On the other hand, when Γ has diagonal elements diverging to infinity (no confidence in the views), the solution to the problem is simply µ = µmkt . In order to better understand the way in which the two terms in the objective function are weighed it is useful to rewrite the objective function as: µBL = arg min µ b a (µ − µmkt )0 (A)−1 (µ − µmkt ) + (P µ − V )0 (B)−1 (P µ − V ) a+b a+b where A and B are “numeraire” varcov matrices, that is: the τ Σ and Γ varcov matrices whose terms have been divided by the sum of the terms on the diagonal. In other terms these matrices express variances in relative terms. a and b are, resp., the diagonal sum for the terms in (τ Σ)−1 and (Γ−1 ). Using these definitions it is easy to see that the weight of each of the two terms in the quadratic form is given by the relative value of a resp. b. In order to find µBL we begin by taking the first derivative of the objective function w.r.t. µ: ∂L = 2(τ Σ)−1 (µ − µmkt ) + 2P 0 (Γ)−1 (P µ − V ) ∂µ we set this to 0: (τ Σ)−1 (µBL − µmkt ) + P 0 (Γ)−1 (P µBL − V ) = 0 ((τ Σ)−1 + P 0 (Γ)−1 P )µBL − ((τ Σ)−1 µmkt + P 0 (Γ)−1 V ) = 0 µBL = ((τ Σ)−1 + P 0 (Γ)−1 P )−1 ((τ Σ)−1 µmkt + P 0 (Γ)−1 V ) This last is the celebrated Black and Litterman formula and gives the value of the vector of expected returns which optimally (in the distance sense considered above) pools the market and the fund manager opinions. The Black and Litterman formula can be written in a slightly different and interesting way by exploiting the “matrix inversion lemma” or by direct tedious computation. The result is 69 µBL = µmkt + (τ Σ)P 0 (P τ ΣP 0 + Γ)−1 (V − P µmkt ) = µmkt + K(V − P µmkt ) 69 The tedious computations are as follows µBL = ((τ Σ)−1 + P 0 (Γ)−1 P )−1 ((τ Σ)−1 µmkt + P 0 (Γ)−1 V ) = = ((τ Σ)−1 + P 0 (Γ)−1 P )−1 (τ Σ)−1 τ Σ((τ Σ)−1 µmkt + P 0 (Γ)−1 V ) = remember now that A−1 B −1 = (BA)−1 = (I + τ ΣP 0 Γ−1 P )−1 (µmkt + τ ΣP 0 Γ−1 V ) = 197 While the first formula expresses µBL as a weighted average of the market expected values vector and of the fund manager opinions, this second formula implies that the tangency portfolio which takes in to account both market and private information is given by the market portfolio plus a “spread position”. In fact, using the usual formula for the optimal portfolio we have: wBL = gλ(Σ−1 (µmkt − rf ) + Σ−1 K(V − P µmkt )) where g = 1/λ(10 Σ−1 (µmkt − 1rf ) + 10 Σ−1 K(V − P µmkt )) is a constant which makes the sum of weights of the portfolio equal to 1. This is the algebra of the Black and Litterman model. Much more relevant than this is the choice of inputs which strongly influence the final allocation. 12.0.1 The market portfolio In principle the market portfolio should consider the total value of each traded asset that is the financial wealth of all agents. To leave out assets or to aggregate assets in portfolios implies an incorrect evaluation of µmkt . In practice it is obviously impossible to use all marketable assets: just as an instance this should require the need for the estimation of a huge variance covariance matrix of thousands of rows, an impossible task. The current practice is that of either choosing a detailed analysis (asset by asset) of a single market or market subset or to use the Black and Litterman model for strategic allocations among aggregated asset classes. Much could be said pro and con each choice. The only point we want to stress here is that the market expected return for each asset derived by the Black and Litterman model strongly depends on the choice of the market portfolio proxy. In other words: the market expected return for the same asset or asset class shall differ if this asset or asset class shall be introduced in different proxies of the market portfolio. This could imply some degree of incoherence if the model is used for helping in the allocation of many, partially overlapping, portfolios. = (I + τ ΣP 0 Γ−1 P )−1 (µmkt + τ ΣP 0 Γ−1 V + τ ΣP 0 Γ−1 P µmkt − τ ΣP 0 Γ−1 P µmkt ) = = (I + τ ΣP 0 Γ−1 P )−1 ((I + τ ΣP 0 Γ−1 P )µmkt + τ ΣP 0 Γ−1 (V − P µmkt )) = = µmkt + (I + τ ΣP 0 Γ−1 P )−1 τ ΣP 0 Γ−1 (V − P µmkt ) = = µmkt + (I + τ ΣP 0 Γ−1 P )−1 τ ΣP 0 Γ−1 (Γ + P τ ΣP 0 )(Γ + P τ ΣP 0 )−1 (V − P µmkt ) = = µmkt + (I + τ ΣP 0 Γ−1 P )−1 (τ ΣP 0 + τ ΣP 0 Γ−1 P τ ΣP 0 )(Γ + P τ ΣP 0 )−1 (V − P µmkt ) = = µmkt + (I + τ ΣP 0 Γ−1 P )−1 (τ ΣP 0 + τ ΣP 0 Γ−1 P τ ΣP 0 )(Γ + P τ ΣP 0 )−1 (V − P µmkt ) = = µmkt + (I + τ ΣP 0 Γ−1 P )−1 (I + τ ΣP 0 Γ−1 P )τ ΣP 0 (Γ + P τ ΣP 0 )−1 (V − P µmkt ) = = µmkt + τ ΣP 0 (Γ + P τ ΣP 0 )−1 (V − P µmkt ) 198 Notice the difference with the Markowitz result: in the Markowitz model we find the “best” mean variance portfolio for a given set of assets. It is not surprising, and completely coherent, that the weight of a given asset changes if the other assets in the portfolio to be optimized change. However, the market portfolio, under the CAPM model, is a mean variance portfolio only if taken as a whole. Sub portfolios of the market portfolio are not, in general, mean variance portfolios. Hence, the use of sub portfolios of the market portfolio for deriving from these the market expectation vector as if they were optimal mean variance portfolios could lead to a biased evaluation (question: when it is the case that a sub portfolio of a mean variance portfolio is still mean variance optimal?) A possible way out of this problem is a deeper use of the CAPM theory. We know that the CAPM basic equation is: E(ri ) − rf = λβi m −rf ;ri −rf ) , rm is the return of the where ri is the return for the ith asset, βi = cov(r V ar(rm −rf ) market portfolio and λ(not to be confused but related with the parameter with the same name in the Black and Litterman model) is the market price of risk in units of the market portfolio excess expected return, that is λ = E(rm − rf ). This result suggests this procedure: choose a large cap weighted index as the market portfolio, estimate the betas only for those assets implied in your asset allocation, chose a value for λ and compute the expected excess return for each asset involved in the asset allocation. Following this procedure it is possible to evaluate the market implied excess expected returns of the assets of interest in a coherent way, in the sense that the valuation only depends on the market index used as a proxy of the market portfolio but does not require to compute the variances and covariances of all the assets in the index. We could then use a very exhaustive index and still maintain numerical manageability to our problem. It is to be told that this procedure, while useful, tends to give results which are more unstable than the standard Black and Litterman procedure, in particular when we need to evaluate the market expected excess return for assets with low correlation with respect to the market portfolio. In this case a possible alternative is in the use of multifactor models. 12.0.2 The estimation of Σ. The variance covariance matrix of returns: Σ, is, in general, estimated from the data. The typical estimate of Σ is the standard statistical estimate based of three or more years of monthly or at most weekly data. Sometimes a smoothed estimate analogous to the one we described for the variance is used. A characteristic problem in the estimation of the variance covariance matrix which has a huge influence on portfolio optimization is the overestimation of covariances. This 199 may have dire consequences. Suppose you are considering in your portfolio just two assets. Their expected values are very similar and such are their variances. Suppose now their correlation not to be extreme, in this case even if the expected values are not the same the optimal investment shall share its weights oh both assets. Suppose, instead that the correlation is near one. In this case there is no advantage in diversification and the weight shall concentrate almost completely on the asset with highest expected value. Overestimated correlations, join with bad estimates of expected values are one of the source of observed instability in optimal mean variance allocations. Obviously, if the correlations are really high, there is no problem in allocating the full weight on the highest expected value asset. On the other hand, due to the similarity between expected returns, a shared allocation would not be badly suboptimal. Suppose, on the contrary, that the covariance was overestimated. In this case to concentrate the full weight on one of two, very similar in expectation terms, assets would be a bad mistake: we would lose the chance of decreasing the overall variance thanks to diversification. Working on this intuition we could suggest a simple “shrunk” version of the standard estimate for the variance covariance matrix which shrinks it toward an average covariance. Start with the standard estimate of the variance covariance matrix and derive from this the estimate of the correlation matrix and the estimate of the vector of variance. Let Ω be the estimated correlation matrix. From this derive the shrunk estimate αΩ + (1 − α)Θ where Θ is a reference correlation matrix, typically a matrix with ones on the diagonal and the average of the off diagonal elements of Ω as correlation terms (more complex structures can be considered when we are analyzing data from more than one market). The typical α shall be a number in the range of .8,.9. The resulting estimate shall be converted into a covariance matrix by composing it with the estimates of the variances and shall show less extreme covariances than the original estimate. 12.0.3 The risk aversion parameter λ If the market portfolio does not contain the riskless asset and if its weights sum to 1 then, as seen in the Markowitz section, we have: λ= V (rm ) E(rm ) − rf In principle this parameter could be estimated using a very long time series for an exhaustive market index. In practice, for the usual reasons, only the numerator of λ can be estimated in this way, for the denominator we shall need some a priori guess. 200 12.0.4 The views Where do the views come from? Wish I knew. All I can tell you is how to express your views, if any (otherwise the use of a market portfolio is never a bad choice). In Black and Litterman a view is a specification of the expected return and variance over a given period of time of a portfolio. There exist, broadly, two types of views: absolute and relative. An absolute view is a view on a portfolio where the net position is long or short, a relative view is a view on a portfolio whose sum of weights is zero. Algebraically the weights of the portfolio on which the fund manager expresses views are written in the rows of a matrix P each row represents the weights of a different portfolio and each column the weights for the same asset in different portfolios. The expected value for each view is written in the column vector V = E(P µR ) and the variance for each view in the diagonal elements of the matrix Γ. The correlation between different views is supposed to be 0. Notice that this does not imply that the correlation between the returns of different view portfolio is zero but only that the correlation between the elements of P µR is zero. In order to specify the views the fund manager could simply specify the extremes A, B of an interval in which each row of P µR is believed to fall with probability of about 95%. With this information the expected value of the j-th view could be estimated as √ vj = (Aj + Bj )/2 and the standard deviation as γj = (vj − Aj )/2. Notice that from the formula: µBL = µmkt + (τ Σ)P 0 (P τ ΣP 0 + Γ)−1 (V − P µmkt ) = µmkt + K(V − P µmkt ) we see that the effect of a view depends on the difference between the view expected value V and the market valuation of the same: P µmkt . For instance. Suppose that a view only involves two assets and the view portfolio is short one asset and long the other with identical weights. Suppose that the view expected value is positive. This does not in general imply that, by the effect of this view, the resulting µBL shall display a difference between the expected values of the two assets bigger than that to be found in µmkt . This shall be true only if the difference between the expected values of the two assets in µmkt is smaller that hypothesized in V . It is then very useful, during the process of view specification, to always compare V with P µmkt . 12.0.5 The choice of τ Here we have very little to say. As mentioned above the choice of τ is relevant only relative to the choice of the elements in Γ. In fact, if we multiply both τ and Γ for the same scalar the Black and Litterman formula does not change. The practical meaning of τ is that of transforming Σ, the estimate of the varcov matrix of the return vector R into an estimate of the varcov matrix for µR , the expected return of the return vector. Typical values chosen in the available examples are of the order of 1/3 meaning that 1/3 of the variances of the elements of R is due to a random variation of µR . 201 12.0.6 Further readings Lots of applied papers were written about the Black and Litterman model and its extensions. Excel and Matlab implementations also abound. A possible reference point is the webpage http://www.blacklitterman.org/ by Jay Walters. Examples Exercise 12 - Black and Litterman.xls 13 Appendix: Probabilities as prices for betting In this appendix we give a very small hint of three “formidable” topics: the definition of probability in betting systems; the connection between probability and frequency and the big question concerning whether actual people decisions can be described with probability models. It is obviously impossible to deal with these topics in such small a space but it is in any case useful to point them out for future study. 13.1 Betting systems Here we recall the betting system definition of the probability P (A) of an event A as the price to pay for buying a ticket which results in a win of 1 if A happens and 0 if A does not happen. If we suppose a set of such bets is made on a class of events and any finite combination of events can be bet on or against (in this second case P (A) is the price received for selling the ticket) then avoidance of arbitrage (the existence of combinations of bets which guarantee a win with no cost, what is known as a “dutch book”) implies P must satisfy the properties traditionally assumed for a probability (assigned to a a finite set of events). For instance, suppose you are betting on the event A and the price you pay for this bet is P (A). This must be between 0 and 1. In fact, suppose P (A) > 1, then I accept you bet and in the case A is true i give you 1 with the net gain of P (A) − 1 > 0 while, if A turns out to be false, i keep your P (A). In both cases I am off with a positive gain. This is an arbitrage or, in betting language, a “dutch book”. Suppose now you are betting on both A and A and the prices you are willing to pay for these bets, P (A) and P (A) do not sum to 1. Suppose, for instance, they sum to less than 1. Then, since at this price you are both receiving or making bets, I pay you that prices and bet on both events. Since either A or A is going to be true, I am getting in any case 1, and I paid less than 1 for this. A set of such no arbitrage prices are called “coherent” probabilities and we can proof in very simple way that, if the set of events we bet on constitute an algebra, all properties required to probabilities in an abstract definition of probability must be 202 satisfied. There is a single exception: countable additivity cannot be justified in this setting. This idea of no arbitrage in betting systems is an exact analogue of the modern theory of no arbitrage in financial markets which, however, is a little more general. First: in markets you have time, and time has a value which is usually positive (the interest rate). If the price for the bet is paid today, while the bet is settled in the future, the bettor shall require for the price of the bet to be discounted with the value of time so that, for instance, in the case of the bets on A and A the “spot” prices of the bets shall sum to something less than 1. If pay and settle the bet in the same future day, the prices for the bets shall require no time value (they are now forward prices) and they’ll sum to 1. Second: with simple bets the two possible values for the payoff are 0 or 1. This may be the case for the payoff of some security (as, for instance, digital options) in the general case, however, future values of securities in financial markets may be, in principle, generic real numbers. This kind of problem has been taken into account by probabilists since, at least, two centuries and the extension to this setting of the betting definition is this: suppose you are considering bets i = 1, ..., m on m quantities whose future values shall be Xi . You pay the price (or receive the price) P (Xi ) for betting (receiving the bet) whose payoff is going to be Xi . If you are willing to bet or receive bets of this kind and avoid arbitrage, it is possible to show that P (Xi ) must satisfy the properties we usually require to the mathematical expectation (at least in the discrete or bounded X case). As it is well known, the expected value of a 0, 1 valued random variable is the probability with which the random variable takes the value of 1 and we can represent any event A with such a variable (called the indicator function of A). So in this particular case, the two settings coincide. 13.2 Probability and frequency Now, a point of interest. While all this is quite intuitive, obviously, it does tell us nothing about the values we should choose for these P , provided no arbitrage is involved. Can we say something more about this? Here things become very fuzzy. For instance: we are used to think that probability has something to do with frequency. What is the connection of this with frequency? Attempts at connecting a definition of probability to limits of frequencies have been made but are difficult to justify, after all limits are very metaphysical objects. This not withstanding, it is a widespread notion that any definition of probability should yield a probability calculus whose algebra has to be valid when applied to frequencies of events in finite replications of experiments. This stems from the re203 quirement that probability models should be, in appropriate situations where repeated experiments are possible, useful in describing possible future frequencies. The great mathematician F. P. Ramsey, in a path breaking essay (“Truth and Probability” 1927) states this point in an admirably clear way: “I suggest that we introduce as a law of psychology that [the subject’s] behaviour is governed by what is called the mathematical expectation; that is to say that, if p is a proposition about which he is doubtful, any goods or bads for whose realization p is in his view a necessary and sufficient condition enter into his calculations multiplied by the same fraction, which is called the “degree of his belief in p”. We thus define degree of belief in a way which presupposes the use of the mathematical expectation. We can put this in a different way. Suppose his degree of belief in p is m/n; then his action is such as he would choose it to be if he had to repeat it exactly n times, in m of which p was true, and in the others false.... This can also be taken as a definition of degree of belief, and can easily be seen to be equivalent to the previous definition.” Where “the previous definition” is Ramsey’s version of no arbitrage in betting system definition of (subjective or personal probability). So, according to Ramsey, probability and relative frequencies should (and do) share the same mathematical rules because the personal probability of an event made by decision maker can be interpreted as a forecast of the relative frequency with which the event shall be true in a given set of hypothetical future experiments in the sense that the decision maker action taken on the basis of such a probability “is such as he would choose it to be if he had to repeat it exactly n times, in m of which p was true, and in the others false”. A great Italian mathematician, Bruno deFinetti, gave a proof of a result which, in a very particular case, connects probabilities and frequencies. Suppose you have a sequence of 0, 1 valued random variables and suppose that for each choice of n of these, the probability you give to this sequence depends only on the number of zeros and ones in the sequence (in technical words you say that these random variables are exchangeable). Then, the bigger is n, the nearer should your evaluation of the probability for a sequence containing any given number of 0s and 1s be to the product of the probabilities for each element in the sequence. Moreover, the bigger is n the nearer should be the probability of, say, 1, you use for a 1 in that given sequence, to the relative frequency of 1s in that sequence. In other words, exchangeability means approximate independence, conditional to the relative frequency of 0s and 1s and an approximate value of the probability of 0 or 1 equal to the frequency of 0 or 1 in the given sequence. This result tells us that in some very specific but relevant setting, a probability statement must converge to a relative frequency, so that, in such setting, there exists 204 a connection between values of probabilities and values of frequencies and not only between the rules of probability and the rules of frequency. 13.3 Probability and behaviour A last, formidable topic. We use probability models for describing games and markets. In our probability models we require no arbitrage. Do gamblers/investors behave in a way which agrees with this definition of probability? At the times of Ramsey, it was already know that true behaviour of decision makers in gambling situations did NOT satisfy the basic axioms of probability (and this is true when we study investor behaviour too). The standard algebra we use for computing probabilities should then be considered as a NORMATIVE description of decision making under uncertainty not as a POSITIVE description of behaviour of animal spirit motivated real world obfuscated market agents. In the recent past a lot of research effort has been spent in trying to formalize real world, irrational (that is: arbitrage ridden and incompatible with relative frequencies) decision making behaviour using things the like of sub additive probabilities (P (A) + P (Ā) ≤ 1). This choice, obviously, makes probability not a good model for frequencies (which by definition add up to 1). However, the judgment of the possibility and usefulness of a clever mathematical description of irrational behaviour is better left to the Reader, together with the much more practically relevant question of the opportunity or even possibility on behaving rationally when others do not. 14 Appendix: Numbers and Maths in Economics There is an age old debate on the use of Maths in Economics, dating back at least to the 17th century. The interested student could read, for instance, the introductory papers in the first number of Econometrica (1933) and see an interesting trace of how this debate was intended by the soon to be “winner” side. In particular it shall be useful first to read the position of Joseph Schumpeter and than the position of Ragnar Frisch. Schumpeter writes: There is, however, one sense in which Economics is the most quantitative, not only of ’social’ or ’moral’ sciences, but of all sciences, Physics not excluded. For mass, velocity, current, and the like can undoubtedly be measured, but in order to do so we must always invent a distinct process of measurement. This must be done before we can deal with these phenomena numerically. Some of the most fundamental economic facts, on the contrary, already present themselves to our observation as quantities 205 made numerical by life itself. They carry meaning only by virtue of their numerical character. There would be movement even if we were unable to turn it into measurable quantity, but there cannot be prices independent of the numerical expression of every one of them, and of definite numerical relations among all of them. Here it seems that the simple fact that economic data are mostly recorded in numbers implies that there must exist some relevant “mathematical” structure is behind them. This is opposed to the supposedly more artificial introduction of numbers in Physics through the invention of processes of measurement. Obviously things are not so simple and the opposite view is more likely to be a correct one: even if some phenomenon is recorded as a quantity, it could be the case that no simple mathematical structure can be applied to such quantities in order of revealing some relevant structure of the phenomenon. In the field of Finance, for instance, the fact hat we observe a time series of prices for a given asset does not by itself mean that, say, we can apply time series analysis to this data. For this to be possible it is not necessary nor sufficient that data be expressed as numbers. What is relevant is that, after some translation of data to numbers (but this is by no meas necessary) we can state that for some cogent reason, maybe based on the description of the systems which generates them, these numbers satisfy some probabilistic model of the class suggested by time series analysis. The introduction of quantities, when relevant because satisfying some algebra as, for instance, in Physics, is usually not the beginning but the end of a long conceptual evolution striving to asses in some useful way aspects of real world phenomena involving, say, the movement of bodies, which satisfied some simple rules which could be dealt with mathematical models. In other words: quantities are defined through mathematical models but quantities by themselves do not imply mathematical models. The wonderful quality of abstraction and modeling effort which resulted in the concept of inertial mass and in its central role in successful mathematical modeling the cinematic properties of completely different objects (which could have been, and actually were, “measured” in many other totally unproductive ways) could be a illuminating reading for any student. And here we quote the Editor’s Note by Ragnar Frisch (surely more quantitative minded than Schumpeter) in the same number of the Journal. In it, the reasons and the problems of the use of Mathematics (and most important for this Author, Statistics) are clearly delineated. Mathematics is seen as a tool for speaking with rigor and for deriving testable hypotheses in order to analyze these with the use of Statistics, but: “Mathematics is certainly not a magic procedure which in itself can solve the riddles of modern economic life, as is believed by some enthusiasts. But, when combined with a thorough understanding of the economic significance of the phenomena, it is an extremely helpful tool.” 206 The debate goes on and, as it is frequently the case, its “going on”, not its arguable solution, is the source of its usefulness. 15 Appendix: Optimal Portfolio Theory, who invented it? A theory of optimal reinsurance which, as a particular case, include optimal mean variance portfolio analysis was described in detail bay Bruno deFinetti in a very famous (in Europe and among actuaries) prize winning paper “Il problema dei pieni” (1940) Giornale dell’Istituto Italiano degli Attuari. This was 12 years before Roy and Markowitz papers. In several occasions Italian academicians tried to point out deFinetti priority with respect to portfolio theory, both before and after Harry Markowitz Nobel award of 1990. This was to no avail up to the time when another big name of the American financial academy, Mark Rubinstein, was interested to the case by a small group of Italian researchers. Thanks to Mark Rubinstein, Markowitz agreed to give his opinion on the topic and he did so in: Harry Markowitz: “deFinetti scoops Markowitz”, Journal of Investment Management, Vol. 4, No. 3, Third Quarter 2006. Mark Rubinstein added a preface to the paper where he acknowledged a number of priorities to deFinetti, in the quoted and other papers, among which: early work on martingales (1939), mean-variance portfolio theory (1940), portfolio variance as a sum of covariances, concept of mean-variance efficiency normality of returns, implications of “fat tails”, bounds on negative correlation coefficients, early version of the critical line algorithm, notion of absolute risk aversion (1952), early work on optimal dividend policy (1957), early work on Samuelson’s consumption loan model of interest rates (1956) In his paper Markowitz recognizes deFinetti priority. However he concentrates most of the paper in criticizing a marginal point in the algorithm deFinetti suggests for finding the optimal portfolio. A marginal point both because deFinetti imprecision can easily be corrected in general while it is already perfectly correct for any “sensible” case, and because, while Markowitz suggested in 1956 an algorithm for solving the portfolio optimization quadratic programming problem, he was neither the inventor of quadratic programming nor his contribution to portfolio optimization was ever gauged on the basis of this algorithm. In fact the 1952 paper contains no algorithm at all but it still contains all the basic ideas of optimal portfolio theory. Scientists are people, with all the standard weaknesses. So, it is perhaps ungenerous to recall the scathing answer given by Markowitz to Peter Bernstein worried, after reading Rubinstein introduction and Markowitz paper, about the status of one of his “heroes of the theory of Finance” (Capital Ideas Evolving, 2007, Page. 109) “When I asked Markowitz what he would have done if someone had shown him the deFinetti 207 paper while he was working on his thesis, his response was unqualified: ’I would have seen at once that deFinetti was related to my portfolio selection work, but by no means identical to it. I guess I would have given him a footnote in my paper’”. This seems to be even less than what admitted by Markowitz in his 2006 paper. Indeed, if one reads the 1952 Markowitz paper having deFinetti work in mind, the meaning of “by no means identical” becomes clear: Markowitz paper is a very particular case of deFinetti’s general approach. The question could then be: which should be the paper and which the footnote... Peter Bernstein comment to this answer is quite cryptic: so naive it could be understood as intentionally naive: “This answer was a great relief to me. As it should be to all who appreciate the value of Capital Ideas to the world of investing. Markowitz’s work on portfolio selection was the foundation of all that followed in the theory of Finance, and of the Capital Asset Pricing Model in particular”. Alas, I was unable to ask Bernstein about the true meaning of this sentence as he died in 2009 before I read his book. 16 Appendix: Gauss Markoff theorem The Gauss-Markoff theorem is a nice subject case in the history of scientific attribution. It is generally difficult and sometimes vain to assess scientific priorities. New ideas are quite infrequently “new”, they typically ripe out of a rich story of attempts and crystallize into new views only in the academically regulated opinion of posterity. However to trace back some concept to its fuzzy origin is always an interesting and often an illuminating endeavor. Students of Finance should perhaps look after the origin of option pricing and Arrow state price density in Vinzenz Bronzin work of 190870 , put call parity in De La Vega (1688) some three hundred years before the “official” discovery by Stoll (1969 Journal of Finance), optimal portfolio theory (and much else) as already mentioned in deFinetti (1940), and so on. In the same vein the least squares method, having to do with Pythagoras theorem, is quite old but, in its modern form and applied to the interpolation of noisy data, it is sometimes credited to the young Gauss (1795). However it does exist a manuscript dated from the 1770’s by the great Italian mathematician Lagrange (would be Lagrangia changed into Lagrange in France where he spent most of his active life). On the other end Lagrange memoir was probably inspired by a work of Simpson (1757). These and related work have to do with the “theory of errors” that is they considered the best way of putting together observations subject to what we could today call “random error” in order to better “estimate” (again a modern word) one or more unknown quantity. This is one of the origin of the concept of, in general weighted, arithmetic average (the other, more ancient, and from this comes the name “average” 70 see “Vinzenz Bronzin’s option pricing models, exposition and appraisal”, Springer 2009 208 which has to do with “avaria” an Italian term coming from Arabic, is quite peculiar: it comes from a rule for distributing across all merchants whose goods are the freight of a ship, losses due to jettisoning in order to save the ship). As we know, the arithmetic mean of n numbers minimizes the sum of squares of the differences between the given numbers and the mean itself. So, it is a least squares estimate of a constant. A precise description of the method was given by Legendre (1805) who was probably the first to use its modern name “least squares” (moindres carrés). Moreover Legendre explicitly considered the application of the method to linear approximations. Gauss quoted Legendre in his astronomical work “Theoria motus” of 1809 and mentioned the fact that the method of Legendre had been used by himself in 1795. Legendre was quite upset by this and wrote to Gauss to complain about this appropriation. Laplace published between 1812 and 1820 his “Théorie analytique des probabilités”. Where least squares are given an important (if mathematically quite cumbersome) place. The renown of this book in the nineteen century was very important in establishing the method and tended to overshadow Gauss contribution. This may seem strange as Laplace himself wrote in his Théorie a short historical note on the theory of errors very similar to the one summarized here. If there may be some controversy about the invention of least squares, there is no controversy about the so called “Gauss-Markoff” theorem. Neither Legendre nor Laplace or Lagrange or Simpson ever stated or gave a proof of this theorem. The Gauss-Markoff theorem has to do not with least squares as an algorithm but with conditions for the optimality of least squares. It deals, in other words, not with least squares as a tool for fitting “models” to given “data” but with least squares properties from a probabilistic point of view, that is, considering all possible potential sampoles. This is a paramount instance of a new attitude which was developing between the end of 18th and the beginning of 19th century where Probability, originally developed from gambling problems, begins to be seen as a tool for justifying what today we would call “statistical inference”. The nearest thing to Gauss theorem in in Laplace’s work, is a result about what we could today call “consistency” of the least square estimate: again a probabilistic justification for a statistical method. As far as I know the proof of the “Gauus-Markoff” theorem was first given by Gauss in his celebrated “Theoria combinationis observationum erroribus minimis obnoxiae” which we could translate: “Theory of the least affected by errors combination of observations” or “Theory of those combinations of observations which are least affected by errors” (1821). A work where, by the way, Gauss (re)introduces the Gaussian density (but does not ask for the errors to be Gaussian) and also gives a table for some of its quantiles (a curiosity: Gauss table does not contain the .975 quantile, the famous 1.96). 209 In this work it could be difficult to recognize at first reading the theorem for at least two reasons. First: while the result is there even in a greater generality than what we are used to, the kind of mathematical language is quite different from current one. Second: the result is not directly connected with the linear model (which is a special case) but with the general problem of “estimating” (our word) one or more unknown constants observed with error (the theory of errors mentioned above). It is to be noticed that the case of unequal variances across errors is considered too (in fact this GLS approach is used along all the work). Gauss already gave a related result some year before, in 1809 in the astronomical work “Theoria motus”, but with a different argument which today we could describe as a mix of bayesianism and maximum likelihood. Why the name of Markoff? The great Russian mathematician gives a proof of the theorem in Ch. 7 of his book “Исчисление вероятностей” (Calculus of probabilities). This became available in an abriged German translation in 1912 (with a preface by the Author himself). At a direct reading, it is clear that Markoff chapter on least squares is a slightly more modern and detailed version of Gauss work (included the use of the Gaussian density). In fact, the first bibliographic entry at the end of the chapter, German version, is a German translation from 1887 of Gauss original Latin. This is not an addition, but a slight alteration, by the translators: the original Russian version quotes the same work by Gauss as translated in French by J. Bertrand in 1855. Markoff work was quoted by J. Neyman (a Polish-American mathematical statistician, one of the creators of hypothesis testing and, more in general, of modern mathematical Statistics) in a discussion paper he presented at the Royal Statistical Society June 19th, 1934: “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection”, Journal of the Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625. He names “Markoff method” (Note II. pag. 563, 593) a set of procedures inspired to the above quoted chapter in Markoff book. Neyman, pag. 563, gives references for both the Russian and the German version of Markoff book. Neyman, however, seems to give Markov the merit of the result, or at least of: “the clear statement of the problem. The subsequent theory is matter of easy algebra”. Neyman doubts the method to be known, pag. 564, due to the fact that it was published in Russian. Neyman seems to ignore Gauss work on the topic and does not quote Gauss. The right attribution to Gauss was immediately reinstated by R. A. Fisher in the discussion following the paper, pag. 616. The short historical note by Plackett: Biometrika, Vol. 36, No. 3/4 (Dec., 1949), pp. 458-460, could be an interesting reading as it sets things straight. In this note, moreover, we have one of the first instances where the result is phrased using matrix notation in a way similar to the one we use today. A more thorough historical paper with more details about our story is: “Studies in the History of Probability and Statis210 tics. XV: The Historical Development of the Gauss Linear Model” by Hilary L. Seal (Biometrika, Vol. 54, No. 1/2 (Jun., 1967), pp. 1-24). A last note about matrix notation. Modern matrix notation is relatively new in Statistics. One of the first and, arguably, the first famous instance of a modern matrix based presentation of least squares, and GLS (but no explicit version of “GaussMarkoff” theorem) is in a paper by Alexander Aitken: “On least squares and linear combination of observations,” Proc. Royal Soc. Edinburgh, 55, (1935), 42-48. The New Zealander mathematician, beyond many important results, also vested an important role as an influential pioneer of the use of matrix calculus and matrix notation in Statistics. Summary: the theorem, in its general form, is by Gauss; Markoff never pretended to be the author of the result and explicitly quotes Gauss; Markoff name was added to that of Gauss probably because of a wrong attribution by Neyman; this became common use notwithstanding the immediate correction by the none less that R. A. Fisher. 17 Exercises and past exams All past exams are available with solutions on Francesco Corielli webpage. The following table is a cross reference of each exercise in each past exam classified by argument. Since some exercise requires topics from different chapters, the classification is only approximate. If you download the exam pdf files in the same directory as the handouts, a click on the date should open the proper file. 211 2005-12-23 2006-01-11 2006-02-09 2006-04-04 2007-01-24 2007-02-12 2007-09-11 2007-12-19 2008-01-23 2008-02-11 2008-09-12 2008-12-17 2009-01-21 2009-02-09 2009-09-11 2009-12-15 2010-01-20 2010-02-08 2010-09-10 2010-12-14 2011-01-26 2011-02-07 2011-09-09 2011-12-13 2012-01-18 2012-02-06 2012-09-07 2012-12-11 Subjects ⇒ Dates ⇓ 212 4 46 26 2 1 1 5 678 68 7 6 56 7 13 17 17 1 1 1 7 1 7 2 Basic prob and Stats 19 6 3 1 2 7 4 5 74 7 5 4 4 4 24 4 4 5 7 5 3 2 VaR 2 2 2 2 Variance estimation 3 2 5 34 3 7 3 4 6 36 5 5 7 247 34 3 Risk premia, returns and time diversification 1 2 2.1 1 Linear Model 9 456 456 3456 5 48 356 3 1235 23 25 167 1367 234 235 1357 3 13 4 15 15 27 12 1 1467 5 2346 17 256 5 6 4 4 Style Analysis 10 5 25 6 7 Factor Models 11 78 6 17 6 5 12 6 7 1 3 5 3 2 1 6 6 5 4 4 78 8 6 6 Principal Components 11.2 2 3 7 23 23 6 7 3 2 4 1 3 4 16 6 5 9 9 Black and Litterman 12 2013-01-15 2013-02-06 2013-09-06 2013-12-10 2014-01-15 2014-02-05 2014-09-05 2014-12-09 2015-01-14 2015-02-04 2015-09-04 2015-12-09 2016-01-13 2016-02-04 2016-09-09 2016-12-12 2017-01-18 2017-02-02 2017-09-01 2017-12-19 2018-01-22 2018-09-03 2018-12-18 2019-01-21 Subjects ⇒ Dates ⇓ 213 6 12 15 1 6 4 4 34 7 5 7 47 4 5 1 7 56 7 5 4 6 6 5 3 3 Variance estimation 3 7 Basic prob and Stats 19 2 5 1 5 5 4 4 1 3 5 5 3 2 2 67 5 4 3 3 2 1 5 2 VaR 7 7 34 3 36 7 17 1 6 3 7 5 7 6 7 Risk premia, returns and time diversification 1 2 2.1 3 6 Linear Model 9 1 145 4 13 245 347 4 56 1256 247 257 246 12 27 346 3457 15 256 47 12 124 27 12 12 1 6 6 3 7 1 4 Style Analysis 10 6 6 7 Factor Models 11 3 3 3 3 2 2 1 1 6 2 3 6 Principal Components 11.2 6 27 2 5 6 5 1 6 3 4 4 3 Black and Litterman 12 5 18 18.1 Appendix: Some matrix algebra Definition of matrix A matrix A is an n−rows m−columns array of elements the elements are indicated by ai,j where the first index stands for row and the second for column. n and m are called the row and column dimensions (sometimes shortened in “the dimensions”) or sizes of the matrix A. Sometimes we write: A is a nxm matrix. Sometimes a matrix is indicated as A ≡ {aij }. When n = m we say the matrix is square. When the matrix is square and aij = aji we say the matrix is symmetric. When a matrix is made of just one row or one column it is called a row (column) vector. 18.2 Matrix operations 1. Transpose: A0 = {aji }. A00 = A. If A is symmetric then A0 = A. 2. Matrix sum. The sum of two matrices C = A + B is defined if and only if the dimensions of the two matrices are identical. In this case C has the same dimensions as A and B and cij = aij + bij . Clearly A + B = B + A and (A + B)0 = A0 + B 0 3. Matrix product. The product C = AB of two matrices nxm and qxk Pis defined if and only if m = q. If this is the case C is a nxk matrix and cij = l ail blj . In the matrix case it may well be that AB is defined but BA not. An important property is C 0 = B 0 A0 or, that is the same, (AB)0 = B 0 A0 . Provided the products and sums involved in what follows are defined we have (A + B)C = AC + BC. 18.3 Rank of a matrix A row vector x is said to be linearly dependent from the row vectors of a matrix A if it is possible to find a row vector z such that x = zA. The same for a column vector. r(A) (or rank(A)): the rank or a matrix A, is defined as the number of linearly independent rows or (the number is the same) the number of linearly independent columns of A. A square matrix of size n is called non singular if r(A) = n. If B is any n × k matrix, then r(AB) ≤ min(r(A), r(B)). If B is an n × k matrix of rank n, then r(AB) = r(A). If C is an l × n matrix of rank n, then r(CA) = r(A). 214 18.4 Some special matrix 1. A square matrix A with elements aij = 0, i 6= j is called a diagonal matrix. 2. A diagonal matrix with the diagonal of ones is called identity and indicated with I. IA = A and AI = A (if the product is defined). 3. A matrix which solves the equation AA = A is called idempotent. 18.5 Determinants and Inverse There are several alternative definitions for the determinant of a square matrix. for the determinant of an n × n matrix A is det(A) = |A| = P The Leibniz Qformula n σ∈Sn sgn(σ) i=1 Ai,σi . Here the sum is computed over all permutations σ of the set 1, 2, ..., n. sgn(σ) denotes the signature of σ; it is +1 for even σ and −1 for odd σ. Evenness or oddness can be defined as follows: the permutation is even (odd) if the new sequence can be obtained by an even number (odd, respectively) of switches of elements in the set. The inverse of a square matrix A is the solution A−1 (or inv(A)) to the equations A−1 A = I = AA−1 . If A is invertible then (A0 )−1 = (A−1 )0 The inverse of a square matrix A exists if and only if the matrix is non singular that is if the size and the rank of A are the same. A square matrix is non singular if and only if it has non null determinant. det(A−1 ) = 1/ det(A) If the products and inversions in the following formula are well defined (that is dimensions agree and the inverse exists), then (AB)−1 = B −1 A−1 . Inversion has to do with the solution of linear non homogeneous systems. Problem: find a column vector x such that Ax = b with A and b given. If A is square and invertible then the unique solution is x = A−1 b. If A is n × k with n > k but r(A) = k then the system Ax = b has in general no exact solution, however the system A0 Ax = A0 b has the solution x = (A0 A)−1 A0 b. 18.6 Quadratic forms A quadratic form with coefficient matrix given by the symmetric matrix A and variables vector given by the column vector x (with size of A equal to the number of rows of x) is the scalarPgiven P by: 0 x Ax = i j aij xi xj . A symmetric matrix A is called semi positive definite if and only if x0 Ax ≥ 0 for all x It is called positive definite if and only if 215 x0 Ax > 0 for all non null x If a matrix A can be written as A = C 0 C for any matrix C then A is surely at least psd. In fact x0 Ax = x0 C 0 Cx but this is the product of the row vector x0 C 0 times itself, hence a sum of squares and this cannot be negative. It is also possible to show that any psd matrix can be written as C 0 C for some C. 18.7 Random Vectors and Matrices (see the following appendix for more details) A random vector, resp matrix, is simply a vector (matrix) whose elements are random variables. 18.8 Functions of Random Vectors (or Matrices) • A function of a random vector (matrix) is simply a vector (or scalar) function of the components of the random vector (matrix). • Simple examples are: the sum of the elements of the vector, the determinant of a random matrix, sums or products of matrices and vectors and so on. • We shall be interested in functions of the vector (matrix) X of the kind: Y = A + BXC where A, B and C are non stochastic matrices of dimensions such that the sum and the products in the formula are well defined. • A quadratic form x0 Ax with a non stochastic coefficient matrix A and stochastic vector x is and example of non linear, scalar function of a random vector. 18.9 Expected Values of Random Vectors • These are simply the vectors (matrices) containing the expected values of each element in the random vector (matrix). • E(X 0 ) = E(X)0 • An important result which generalizes the linear property of the scalar version of the operator E(.) for the general linear function defined above, is this E(A + BXC) = A + BE(X)C. 18.10 Variance Covariance Matrix • For random column vectors, and here we mean vectors only, we define the variance covariance matrix of a column vector X as: V (X) = V (X 0 ) = E(XX 0 ) − E(X)E(X 0 ) = E((X − E(X))(X − E(X))0 ) 216 • The Varcov matrix is symmetric, on the diagonal we have the variances (V (Xi ) = 2 σX ) of each element of the vector while in the upper and lower triangles we have i the covariances (Cov(Xi ; Xj )). • The most relevant property of this operator is: V (A + BX) = BV (X)B 0 • From this property we deduce that varcov matrices are always (semi) positive definite. In fact if A = V (z) and x is a (non random) column vector of the same size as z, then V (x0 z) = x0 Ax which cannot be negative for any possible x. 18.11 Correlation Coefficient • The correlation coefficient between two random variables is defined as: ρXi ;Xj = Cov(Xi ; Xj ) σXi σXj The correlation matrix %(x) of the vector x of random variables is simply the matrix of correlation coefficients or, that is the same, the Varcov matrix of the vector of standardized Xi . • The presence of a zero correlation between two random variables is defined, sometimes, linear independence or orthogonality. The reader should be careful using these terms as they exist also in the setting of linear algebra but their meaning, even if connected, is slightly different. Stochastic independence implies zero correlation, the reverse proposition is not true. 18.12 Derivatives of linear functions and quadratic forms Often we must compute derivatives of functions of the kind x0 Ax (a quadratic form) or x0 q (a linear combination of elements in the vector q with weights x) with respect to the vector x. In both cases we are considering a (column) vector of derivatives of a scalar function w.r.t. a (column) vector of variables (commonly called a “gradient”). There is a useful matrix notation for such derivatives which, in these two cases, is simply given by: ∂x0 Ax = 2Ax ∂x and ∂x0 q =q ∂x 217 The proof of these two formulas is quite simple. In both cases we give a proof for a generic element k of the derivative column vector. For the linear combination we have X x0 q = xj q j j ∂x0 q = qk ∂xk For the quadratic form ∂x0 Ax = 2x0 A ∂x0 XX x0 Ax = xi xj ai,j i ∂ P P i j xi xj ai,j ∂xk = X j6=k j X X X xj ak,j + xi ai,k +2xk ak,k = xj ak,j + xj ak,j +2xk ak,k = 2Ak,. x i6=k j6=k j6=k Where Ak,. means the k − th row of A and we used the fact that A is a symmetric matrix. An important point to stress is that the derivative of a function with respect to a vector always has the same dimension as the vector w.r.t. the derivative is taken, in this case x, so, for instance ∂x0 Ax = 2Ax ∂x and not ∂x0 Ax = 2x0 A ∂x (remember that A is symmetric). 18.13 Minimization of a PD quadratic form, approximate solution of over determined linear systems Now Let us go back to the linear system Ax = b with A an n × k matrix of rank k. If n > k this system has, in general, no solution. However, let’s try to solve a similar problem. By solving a system we wish for Ax − b = 0 in our case this is not possible so let us try and change the problem to this minx (Ax − b)0 (Ax − b). In words try to minimize the sum of squared differences between Ax and b if you cannot make it equal to 0. 218 We have (Ax − b)0 (Ax − b) = x0 A0 Ax + b0 b − 2b0 Ax Now let us take the derivative of this w.r.t. x ∂ 0 x AA0 x + b0 b − 2b0 Ax = 2A0 Ax − 2A0 b ∂x (remember the rule about the size of a derivatives vector). We now create a new linear system equating these derivatives to 0. A0 Ax = A0 b And the solution is x = (A0 A)−1 A0 b This is the “least squares” approximate solution of a (over determined) linear system. (see the Appendix on least squares and Gauss Markov model). 18.14 Minimization of a PD quadratic form under constraints. Simple applications to Finance Suppose we are given a column vector r where rj is the random (linear) return for the stock j. Suppose we are holding these returns in a portfolio for one time period and that the (known) relative amount of each stock in our portfolio is given by the column vector w such that 10 w = 1 where 1 indicates a column vector of ones of the same size as w. Then the random linear return of the portfolio over the same time period is given by rπ = w0 r. Since w is known we have E(w0 r) = w0 E(r) and V (w0 r) = w0 V (r)w. The fact that, over one period of time, the expected linear return and the variance of the linear return of a portfolio only depend on the expected values and the covariance matrix of the single returns and the weight vector is what allows us to implement a simple optimization method. For the moment let us suppose that the problem is min w0 V (r)w w:10 w=1 In this problem we want to minimize a quadratic form under a linear constraint. It is to be noticed that, without the constraint, the problem would be solved by w = 0 (no investment). The constraint does not allow for this. Such problems can be solved with the Lagrange multiplier method. The idea is to artificially express, in a single function, both the need of minimizing the original function and the need to do this with respect to the constraint 10 w = 1. 219 In order to do this we define the Lagrangian of the problem given by L(w, λ) = w0 V (r)w − 2λ(10 w − 1) In this function the value of the unconstrained objective function is summed with the value of the constraint multiplied by a dummy parameter 2λ. We now take the derivatives of the Lagrangian w.r.t. w and λ. ∂ (w0 V (r)w − 2λ(10 w − 1)) = 2V (r)w − 2λ1 ∂w ∂ (w0 V (r)w − 2λ(10 w − 1)) = 2(10 w − 1) ∂λ If we set both these to zero we get, supposing V (r) invertible V (r)w = λ1 10 w = 1 Notice the difference between the 1-s. In the first equation 1 is a column vector which is required because we cannot equate a vector to a scalar. The same for 1’ in the second equation which while the r.h.s. is a scalar one (for dimension compatibility with the l.h.s.). We do not stress this using, e.g., boldface for the vector 1 because the meaning follows unambiguously from the context. It is clear that the second equation is satisfied if and only if w satisfies the constraint. What is the meaning of the first equation (or, better, set of equations)? The unconstrained equation would have been V (r)w = 0 whose only solution (due to the fact that V (r) is invertible) would be w = 0. But this solution does not satisfy the constraint. What we shall be able to get is V (r)w = λ1. For some λ chosen in such a way that the constraint is satisfied. To find this λ, simply put together the result of the first set of equations: w = λV (r)−1 1 and the equation expressing the constraint: 10 w = 1. Both equations are satisfied if and only if λ = 1/10 V (r)−1 1 We now know λ, that is we know of exactly how much we must violate the unconstrained optimization condition (first set of equations) in order to satisfy the constraint (second equation). In the end, putting this value of λin the solution for the first set of equations, we get V (r)−1 1 w= 0 1 V (r)−1 1 It is to be noticed that these are only necessary conditions but, for our purposes, this is enough. 220 What we got is the one period ”minimum variance portfolio” made of securities whose returns covariance is V (r). What is the variance of this portfolio? V (w0 r) = w0 V (r)w = 10 V (r)−1 V (r)−1 1 10 V (r)−1 1 1 V (r) = = 10 V (r)−1 1 10 V (r)−1 1 (10 V (r)−1 1)2 10 V (r)−1 1 The expected value shall be E(w0 r) = w0 E(r) = 1V (r)−1 E(r) 10 V (r)−1 1 If V (r) is only spd, then it shall not be invertible, so that the system V (r)w = λ1 cannot be solved by simple inversion. In this case, however, there shall exist nonnull vectors w∗ such that w∗0 V (r)w∗ = 0 and, using such w∗ it shall be possible to build portfolios of the securities with (linear) return vector r, and maybe the risk free, such that the return of such portfolios is risk free (zero variance) even if its components are risky. Such riskless retunr must be equal to the risk free rate for no arbitrage to hold. 18.15 The linear model in matrix notation Suppose you have a matrix X of dimensions n × k containing n observations on each of k variables. You also have a n × 1 vector y containing n observations on another variable. You would like to approximate y with a linear function of X that is: Xb for some k×1 vector b. In general, if n > k it shall not be possible to exactly fit Xb to y so that the approximation shall imply a vector of errors = y − Xb. You would like to minimize but this is a vector, we must define some scalar function of it we wish to minimize. A possible solution is 0 that is: the sum of squares of the errors. We then wish to minimize 0 = (y − Xb)0 (y − Xb) = y 0 y + b0 X 0 Xb − 2y 0 Xb If we take the derivative of this w.r.t b we get ∂ 0 (y y + b0 X 0 Xb − 2y 0 Xb) = 2X 0 Xb − 2X 0 y ∂b (again remember the size rule and remember that y 0 Xb = b0 X 0 y each is the transpose of the other but both are scalars). The solution of this is b = (X 0 X)−1 X 0 y 221 This simple application of the rule for the approximate solution of an over determined system yields the most famous formula in applied (multivariate) Statistics. When this problem, for the moment just a best fit problem, shall be immersed in the appropriate statistical setting, our b shall become the Ordinary Least Squares parameter vector and shall be of paramount relevance in a wide range of applications to Economics and Finance. 19 Appendix: What you cannot ignore about Probability and Statistics A quick check The following simple example deal with the relations and differences between probability concepts and statistical concepts. Let us start from two simple concepts: The mean and the expected value. You know that an expected value has to do with a probability model: you cannot compute it if you do not know the possible values of a random variable and their probabilities. On the other hand an average or mean is a simpler concept involving just a set of numerical values: you take the sum of the values and divide by their number. Sometimes, if certain assumptions hold (e.g. iid data), an expected value can be estimated using a mean computed over a given dataset. Moreover when a mean is seen not as an actual number, involving the sum of actually observed data divided by the number of summands, but as a sum of still unobserved, hence random, data, divided by their number, a mean becomes a random quantity, being a function of random variables, hence entering the field of Probability and need for its description a probability model. As such, it is reasonable to ask for its probability distribution, expected value and a variance. In fact this is the study of “sampling variability” for an estimate. At the opposite, probability distribution, expected value and variance have no interesting meaning for a mean of a given set of numbers which has one and only one possible value. This dualism between quantities computed on numbers and functions of random variables is true for all other statistical quantities. In the case of the mean/average vs expected value, we use (but not always) different names to stress the different role of the objects we speak of. The same is done (usually) when we distinguish “frequency” and “probability”. To apply this to each statistical concept would be a little cumbersome and, in fact, is not done in most cases. A variance is called a variance both when used in the probability setting and as a computation on 222 number, the same for moments, covariances etc. Even the word “mean” is often used to indicate both expected values and averages. This is a useful shortcut but should not trick us in believing that the use of the same name implies identity of properties. Care must be used. In the experience of any teacher of Statistics the potential misunderstandings which can derive from an incomplete understanding of this basic point are at the origin of most of the problems students incur in when confronted with statistical concepts. Consider the following example and, even if you judge it trivial, dedicate some time to really repeat and understand all its steps. Suppose you observe the numbers 1,0,1. The mean of these is, obviously, 2/3. Is it meaningful to ask questions about the expected value of each of these three numbers or of the mean? Not at all, except in the very trivial case where the answer to this question coincides with the actual observed numbers. However, in most relevant cases the numbers we may observe are not predetermined. They are obviously known after we observe data, but it is usually the case that we also want to say something about their values in future possible observations (e.g. we must decide about taking some line of action whose actual result depends on the future values of observable. This is the basic setting in financial investments). We cannot do this without the proper language, we need a model, written in the language of Probability, able to describe the “future possible observations”. For instance, we could think sensible to assume that each single number I observe can only be either 0 or 1, that each possible observation has the same probability distribution for the possible results: P for 1 and 1 − P for 0 and that observations are independent that is: the probability of each possible string of results is nothing but the product of the probability of each result in the string. Since we only know that P is a number between 0 and 1 the mean computed above using data from the phenomenon such modeled (in this case equivalent to the “relative frequency” of 1), has a new role: it could be useful as an “estimate” of P . Under our hypotheses, however, it is clear that the value 2/3 is only the value of our mean for the observed data, it is NOT the value of P which is still an unknown constant. We need something connecting the two. The first step is to consider the possible values that the mean could have had on other possible “samples” of three observations. By enumeration these are 0, 1/3, 2/3, 1. We can also compute, under our hypotheses the probabilities of these values. Since a mean of 1 can happen only when we observe three ones, and since the three results are independent and with the same probability P , we have that the probability of observing a mean of 1 is P P P = P 3 . On the other hand a mean of 0 can only be observed when we only observe zeroes, that is with probability (1 − P )3 . A mean of 1/3 can be obtained if we observe a 1 and two 0s. There are three possibilities for this: 1,0,0; 0,1,0 and 0,0,1. the respective probabilities (under our hypotheses, are) 223 P (1 − P )(1 − P ); (1 − P )P (1 − P ) and (1 − P )(1 − P )P. The three possibilities exclude each other so we can sum up the probabilities. In the end we have 3P (1 − P )2 . The same reasoning gives us the probability of observing a mean of 2/3 that is: 3P 2 (1 − P ). What we just did is to compute the “sampling distribution” of the mean seen as an “operator” which we can apply to any possible sample. This sampling distribution of the mean gives us all its (four) possible values (on n = 3 samples) and their probabilities as functions of P . Since we now have both the possible values and their probabilities we can compute the expected value and the variance of this mean. This is the second step to take in order to connect the estimate to the “parameter” P . These computations shall give us information about how good an “estimate” the mean can be of the unknown P . We would like the mean to have expected value P (unbiasedness) and as small a variance as possible, so to be “with high probability” “near” to the true but unknown value of P . what this last sentence means is, simply, that the probability of observing samples where the mean has a value near P should be big. Formally an expected value is very similar to a mean with the difference that each value of the (now) random variable is multiplied by its probability and not its frequency. The expected value shall be: 0 ∗ (1 − P )3 + 1/3 ∗ 3 ∗ P (1 − P )2 + 2/3 ∗ 3 ∗ P 2 (1 − P ) + 1 ∗ P 3 = P Notice the difference with respect to a mean computed on a given sample and be careful not to mistake the point. The difference is not that the result is not a number but an unknown “parameter”. It could well be that P is known and, say, equal to 1/3 so that the result would be a number. The difference is that this quantity, the expected value of the sample mean, is a probability quantity, has nothing to do with actual observations and frequencies and has everything to do with potential observations and probability. In fact, on each given sample we have a given value of the mean so its expected value has a meaning only because we consider the variability of the values of this mean on the POSSIBLE samples. However, the result is very useful both if P is unknown and if it is known. When P is unknown it tells us that the mean of the observed data shall be unbiased as an estimate for P . When P is known to be, say, 1/3 it shall give P an “empirical connection” to an observable quantity, by assessing that the expected value of the mean of the observed data shall be 1/3. The question, however is: OK, this for the expected value of the sample mean. But: how much is it probable that the actual observed mean be “near” P ? Well, suppose for instance P = 1/3, we immediately see that the probability of observing a mean equal to 1/3 is 1/9 while the probability of observing a mean between 0 and 2/3 is 1-1/9 very near to 1. However this computation is quite cumbersome and difficult to make for bigger samples than our 3 observations. 224 A more useful answer to this question requires the computation of the (sampling) variance of our mean. By using the definition of variance and the probabilities already computed we get for the variance of the mean: P (1 − P )/3 The general case, for a sample of size not 3 but n shall be: P (1 − P )/n Clearly the bigger n the smaller the variance. This, again, for unknown P is an unknown number. However we can say much about its value. In fact, since P is between 0 and 1, P (1 − P ) has a maximum value of 1/4 (and this is the exact value when P = 1/2). How is this connected with the probability of observing a mean “near” to P ? The answer is given by Tchebicev inequality. This says that for any random variable X (hence also for the sample mean) we have: P rob(E(X) − k p p V (X) < X < E(X) + k V (X)) ≥ 1 − 1/k 2 (For any positive k). This implies, for instance, that if P = 1/3 and n = 3, there is at least a probability of .75 that the sample mean be observed between the values 1/3-.3142 and 1/3+.3142. This is already a very useful information, but think what happens when the sample size is not 3 but, say, 100. In this case the above interval becomes much more narrow: 1/3-.0544 1/3+.0544. Even if P is unknown with n = 100 the interval for at least a probability of .75 shall never be wider that ±.1 (= ±2 ∗ 1/(4 ∗ 100) ) around the “true” P . Results the like of central limit theorem allow us to be even more precise, but this is outside the scope of this exercise. Beyond the numbers what this boils down to is that, by studying the sample mean as a random variable, random due to the fact that the sample values are random before observing them, we are able to connect a parameter in the model: P to an observable quantity, the sample mean. By converse, we also understand the empirical role of P in determining the probability of different possible observations of the sample mean: bigger P implies higher probability of observing a bigger mean. These are simple instances of two basic points in Statistics as applied to any science: we call the first “(statistical) inference” (transforming information on observed frequencies into information on probabilities) and the second “forecasting” (assessing the probabilities of observations still to be made). This longish example is not intended to teach you any new concept: with the possible exception of Tchebicev inequality all this should be already known after a BA in Economics. 225 You can take it as follows: if you see all the concepts and steps in the example as clear, even trivial, fine! All what follows in this course shall be quite easy. On the other hand, if any step seems fuzzy or inconsequential, dedicate some more time to a quick rehearsal of what you already did during the BA concerning Probability and Statistics. And for any problem ask your teachers. How should you use what follows? In the following paragraphs you shall find a quick summary of basic Probability and Statistics concepts. A good understanding of basic concepts in modern Finance and Economics as applied to the fields of asset pricing, asset management, risk management and Corporate Finance (that is: what you do in the two years master), would require a full knowledge of what follows. As far as this course is concerned a good understanding of the strictly required statistical and Probability concepts (in fact really basics!) can be derived by simply examining the questions asked in past exams. Moreover, before section 1 and section 6 of these handouts you can find a short list of concepts that are essential to the understanding of the first and the second part of these handouts. In what follows a small number of less essential points, preceded by an asterisk, can be left out. The following summary is (obviously) NOT an attempt to write an introductory text of Probability and Statistics. It should be used as a quick summary check: Browse thru it, check if most of the concepts are familiar. In the unlikely case the answer is not (this could be the case for students coming for different field BAs) you should dedicate some time to upgrade your basic notions of Probability and Statistics. For any problem and suggestion ask your teachers. Probability 19.1 Probability: a Language • Probability is a language for building decision models. • As all languages, it does not offer or guarantees ready made splendid works of art (that is: right decisions) but simply a grammar and a syntax whose purpose is avoiding inconsistencies. We call this grammar and this syntax “Probability calculus”. • On the other hand, any language makes it simple to “say” something, difficult to say something else and there are concepts that cannot be even thought in any 226 given language. So, no analysis of what we write in a language is independent on the structure of the language itself, And this is true for Probability too. • The language is useful to deduce probabilities of certain events when other probabilities are given, but the language itself tells us nothing about how to choose such probabilities. 19.2 Interpretations of Probability • A lot of (often quite cheap) philosophy on the empirical meaning of probability boils down to two very weak suggestions: • For results of replicable experiments, it may be that probability assessments have to do with long run (meaning what?) frequency; • For more general uncertainty situations, probability assessments may have something to do with prices paid for bets, provided you are not directly involved in the result of the bet, except with regard to a very small sum of money. • In simple situations, where some symmetry statement is possible, as in the standard setting of “games of chance” where probability as a concept was born, the probability of relevant events can be reduced to some sum of probabilities of “elementary events” you may accept as “equiprobable”. 19.3 Probability and Randomness • Probability is, at least in its classical applications, introduced when we wish to model a collective “random” phenomenon, that is an instance where we agree that something is happening “under constant conditions” and, this not withstanding, the result is not fully determined by these conditions and, a priori, unknown to us. • Traders are interested in returns from securities, actuaries in mortality rates, physicists in describing gases or subatomic particles, gamblers in assessing the outcomes of a given gamble. • At different degrees of confidence, students in these fields would admit that, in principle, it could be possible to attempt a specific modeling for each instance of the phenomena they observe but that, in practice, such model would require such impossible precision in the measurement of initial conditions and parameters to be useless. Moreover computations for solving such models would be unwieldy even in simple cases. 227 • For these reasons students in these fields are satisfied with a theory that avoids a case by case description, but directly models possible frequency distributions for collectives of observations and uses the probability language for these models. 19.4 Different Fields: Physics • Quantum Physics seems the only field where the “in principle” clause is usually not considered valid. • In Statistical Physics a similar attitude is held but for a different reason. Statistical Physics describes pressure as the result of “random” hits of gas molecules on the surface of a container. In doing this they refrain using standard arguments of mechanics of single particle not because this would be in principle impossible but because the resulting model would be in practice useless (for instance its solution would depend on a precise measurement of position and moment of each gas molecule, something impossible to accomplish in practice). 19.5 Finance • Finance people would admit that days are by no means the same and that prices are not due to “luck” but to a very complex interplay of news, opinions, sentiments etc. However, they admit that to model this with useful precision is impossible and, at a first level of approximation, days can be seen as similar and that it is interesting to be able to “forecast” the frequency distribution of returns over a sizable set of days. • The attitude is similar to Statistical Physics where, however, hypotheses of homogeneity of underlying micro behaviours are more easy to sustain. Moreover while we could model in an exact way few particles we cannot do the same even with a single human agent. 19.6 Other fields • Actuaries do not try to forecast with ad hoc models the lifespan of this or that insured person (while they condition their models to some relevant characteristic the like of age, sex, smoker-no smoker and so on) they are satisfied in a (conditional) modeling of the distribution of lifespan in a big population and in matching this with their insured population. • Gamblers compute probabilities, and sometimes collects frequencies. They would like to be able to forecast each single result but their problem, when the result depends on some physical randomizing device (roulette, die, coin, shuffled deck 228 of cards etc.) is exactly the same as the physicist’s problem at least when the gamble result depends by the physics of the randomizing device. • Very different and much more interesting is the case of betting (horse racing, football matches, political elections etc.). In this case the repeatability of events under similar conditions cannot be called as a justification in the use of probabilities and this implies a different and interesting interpretation of probability which is beyond the scope of this summary. • Weather forecasters, as all sensible forecaster (as opposed to fore tellers) phrase their statements in terms of probabilities of basic events (sunny day, rain ,thunderstorms, floods, snow, etc.). In countries where this is routinely done and weather forecasts are actually made in terms of probabilities (as in UK and USA but not frequently in Italy) over time the meaning of , say, “60% probability of rain” and the usefulness of the concept has come to be understood by the general public (probability is not and should not be a mathematicians only concept). • Risk managers in any field (the financial one is a very recent example) aim at controlling the probability of adverse events • Any big general store chain must determine the procedures for replenishing the inventories given a randomly varying demand. This problem is routinely solved by probability models. • A similar problem (and similar solutions) is encountered when (random) demand and (less random) offer of energy must be matched in a given energy grid; channels must be allocated in a communication network; turnpikes must be opened or closed to control traffic, etc. • These are just examples of the applied fields where probability models and Statistics are applied with success to the solution of practical problems of paramount relevance. 19.7 Wrong Models • As we already did see, in a sense all probability models are “wrong”. With the exception (perhaps) of Quantum Mechanics, they do not describe the behaviour of each observable instance of a phenomenon but try, with the use of the non empirical concept of probability, to directly and at the same time fuzzily describe aggregate results: collective events. • For this simple reason they are useful inasmuch the decision payout depends, in some sense, on collectives of events. 229 • They are not useful for predicting the result of the next coin toss but they are useful for describing coin tossING. 19.8 Meaning of Correct • A good use of probability only guarantees the model to be self consistent. It cannot guarantee it to be successful • When the term “correct” is applied to a probability model (would be better to call it “satisfactory”) what is usually meant is that its probability statement are well matched by empirical frequencies (the term “well calibrated” is also used in this sense). • Sometimes, probability models are used in cases when the relevant event shall happen only one or few times. • In this case the model shall be useful more for organizing our decision process than for describing its outcome. “Correct” in this case means: “a good and consistent summary of our opinions”. 19.9 Events and Sets • Probabilities are stated for “events” which are propositions concerning facts whose value of truth can reasonably be assessed at a given future time. However, formally, probabilities are numbers associated with sets of points. • Points represent “atomic” verifiable propositions which, at least for the purposes of the analysis at hand, shall not be derived by simpler verifiable propositions. • Sets of such points simply represents propositions which are true any time any one of the (atomic) propositions within each set is true. • Notice that, while points must always be defined, it may well be the case that we only deal with sets of these and, while elements of these sets, some or all of these points is not considered as a set by itself. For instance, in rolling a standard die we have 6 possible “atomic” results but we could be interested only in the probability of non atomic events the like of “the result is an even number” or “ the result is bigger than 3”. Since probabilities shall be assigned to a chosen class of sets of points and we shall call these sets “events”, it may well be that these “events” do not include atomic propositions (which in common language would graduate to the name “event”). 230 • Sets of points are indicated by capital letters: A, B, C, .... The “universe” set (representing the sure event) is indicated with Ω and the empty set (the impossible event) with ∅(read: “ou”). • Finite or enumerable infinite collections of sets are usually indicated with {Ai }ni=1 and with {Ai }∞ i=1 . • Correct use of basic Probability requires the knowledge of basic set theoretical operations: A ∩ B Intersection, A ∪ B Union, A \ B Symmetric difference, A negation and their basic properties. The same is true for finite and enumerable infinite Unions and intersections: ∪ Ai , ∪ Ai and so on. i=1...n 19.10 i=1...∞ Classes of Events • Probabilities are assigned to events (sets) in classes of events which are usually assumed closed with regard to some set operations. • The basic class is an Algebra, usually indicated with an uppercase calligraphic letter: A. An algebra is a class of sets which include Ω and is closed to finite intersection and negation of its elements, that is: if two sets are in the class also their intersection and negation is in the class. This implies that also the finite union is in the class and so is the symmetric difference (why?). • When the class of sets contains more than a finite number of sets, usually also enumerable infinite unions of sets in the class are required to be sets in the class itself (and so enumerable intersections, why?). In this case the class is called a σ-algebra. The name “Event” is from now on used to indicate a set in and algebra or σ-algebra. 19.11 Probability as a Set Function • A probability is a set function P defined on the elements of an algebra such that: P (Ω) = 1, P (A) = 1 − P (A) and for any finitePnumber of disjoint events {Ai }ni=1 (Ai ∩ Aj = Ø ∀i 6= j) we have: P ( ∪ Ai ) = ni=1 P (Ai ) . i=1...n • If the probability is defined on a σ-algebra we require the above additivity property to be valid also for enumerable unions of disjoint events. 19.12 Basic Results • A basic result, implied in the above axioms, is that for any pair of events we have: P (A ∪ B) = P (A) + P (B) − P (A ∩ B) 231 • Another basic result is that if we have a collection of disjoint events: {Ai }ni=1 (Ai ∩ Aj = Ø ∀i 6= j) another event B such that B = ∪ni=1 (Ai ∩ B) then we Pand n can write: P (B) = i=1 P (B ∩ Ai ) 19.13 Conditional Probability • For any pair of events we may define the conditional probability of one to the other, say: P (A|B) as a solution to the equation P (A|B)P (B) = P (A ∩ B). • If we require, and we usually do, the conditioning event to have positive probability: P (B) 6= 0, this solution is unique and we have: P (A|B) = P (A ∩ B)/P (B). 19.14 Bayes Theorem Using the definition of conditional probability and the above two results we can prove Bayes Theorem. n n SLet {Ai }i=1 be a partition of Ω in events, that is: {Ai }i=1 (Ai ∩ Aj = Ø ∀i 6= j)and Ai = Ω, we have: i=1...n P (B|Ai )P (Ai ) P (Ai |B) = Pn i=1 P (B|Ai )P (Ai ) 19.15 Stochastic Independence • We say that two events are “independent in the probability sense”, “stochastically independent” or, simply, when no misunderstandings are possible, “independent” if P (A ∩ B) = P (A)P (B). • If we recall the definition of conditional probability, we see that, in this case, the conditional probability of each one event to the other is again the “marginal” probability of the same event. 19.16 Random Variables • These are functions X(.) from Ω to the real axis R. • Not all such functions are considered random variables. For X(.) to be a random variable we require that for any real number t the set Bt given by the points ω in Ω such that X(ω) ≤ t is also an event, that is: an element of the algebra (or σ-algebra). • The reason for this requirement (whose technical name is: “measurability”) is that a basic tool for modeling the probability of values of X is the “probability distribution function” (PDF) (sometimes “cumulative distribution function” CDF) of 232 X defined for all real numbers t as: FX (t) = P ({ω} : X(ω) ≤ t) = P (Bt ) and, obviously, in order for this definition to have a meaning, we need all Bt to be events (that is: a probability P (Bt ) must be assessed for each of them). 19.17 Properties of the PDF • From its definition we can deduce some noticeable properties of FX 1. it is a non decreasing function; 2. its limit for t going to −∞ is 0 and its limit for t going to +∞ is one; 3. we have: limh↓0 Fx (t + h) = FX (t) but this is in general not true for h ↑ 0 so that the function may be discontinuous. • We may have at most a enumerable set of such discontinuities (as they are discontinuities of the first kind). • Each of these discontinuities is to be understood as a probability mass concentrated on the value t where the discontinuity appears. Elsewhere F is continuous. 19.18 Density and Probability Function • In order to specify probability models for random variables, usually, we do not directly specify F but other functions more easy to manipulate. • We usually consider as most relevant two cases (while interesting mix of these may appear): 1. the absolutely continuous case, that is: where F shows no discontinuity and can be differentiated with the possible exception of a set of isolated points 2. the discrete case where F only increases by jumps. 19.19 Density In the absolutely continuous case we define the probability density function of X as: X (s) |s=t where this derivative exists and we complete this function in an fX (t) = ∂F∂s arbitrary´way where it does not. Any choice of completion shall have the property: t FX (t) = −∞ fX (s)ds. 233 19.20 Probability Function In the discrete case we call “support” of X the at most enumerable set of values xi corresponding to discontinuities of F and we indicate this set with Supp(X)and define the probability function PX (xi ) = FX (xi ) − lim h↑0 FX (xi + h) for all xi : xi ∈ Supp(X) with the agreement that such a function is zero on all other real numbers. In simpler but less precise words P (.) is equal to the “jumps” in F (.) on the points xi where these jumps happen and zero everywhere else. 19.21 Expected Value The “expected value” of (in general) a function G(X) is then defined, in the continuous and discrete case as ˆ+∞ E(G) = G(s)fX (s)ds −∞ and E(G) = X G(xi )PX (xi ) xi ∈Supp(X) If G is the identity function G(t) = t the expected value of G is simply called the “expected value”, “mathematical expectation”, “mean”, “average” of X. 19.22 Expected Value • If G is a non-negative integer power: G(X) = X k , we speak of “the k-th moment of X and usually indicate this with mk or µk . • If G(X) is the function I(X ∈ A), for a given set A, which is equal to 1 if X = x ∈ A and 0 otherwise (the indicator function of A) then E(G(X)) = P (X ∈ A). • In general, when the probability distribution of Xis NOT degenerate (concentrated on a single value x), E(G(X)) 6= G(E(X)). There is a noticeable exception: if G(X) = aX + b with a andb constants. In this case we have E(aX + b) = aE(X) + b. • Sometimes the expected value of X is indicated with µX or simply µ. 19.23 Variance • The “variance” of G(X) is defined as V (G(X)) = E((G(X) − E(G(X))2 ) = E(G(X)2 ) − E(G(X))2 . • A noticeable property of the variance is that such that V (aG(X)+b) = a2 V (G(X)). 234 • The square root of the variance is called “standard deviation”. For these two quantities the symbols σ 2 and σ are often used (with or without the underscored name of the variable). 19.24 Tchebicev Inequality • A fundamental inequality which connects probabilities with means and variances is the so called “Tchebicev inequality”: P (|X − E(X)| < λσ) ≥ 1 − 1 λ2 • As an example: ifλis set to 2 the inequality gives a probability of at least 75% for X to be between its expected value + and - 2 times its standard deviation. • Since the inequality is strict, that is: it is possible to find a distribution for which the inequality becomes an equality, this implies that, for instance, 99% probability could require a ± “10 σ” interval. • For comparison, 99% of the probability of a Gaussian distribution is contained in the interval µ ± 2.576σ. • These simple points have a great relevance when tail probabilities are computed in risk management applications. • In popular literature about extreme risks, and also in some applied work it is common to ask for a “six sigma” interval. For such an interval the Tchebicev bound is 97.(2)% 19.25 *Vysochanskij–Petunin Inequality Tchebicev inequality can be refined by the Vysochanskij–Petunin inequality which, with the added hypothesis that the distribution be unimodal, states that, for any λ > √23 = 1.632993 4 P (|X − µ| < λσ) ≥ 1 − 2 9λ more than halving the probability outside the given interval given by Tchebicev: the 75% for λ = 2 becomes now 1 − 19 that is 88.(9)%. Obviously, this gain in precision is huge only if λ is not to big. The fabled “six sigma” interval according to this inequality contains at least 98.76% probability just about 1.5% more than Tchebicev. 235 19.26 *Gauss Inequality This result is an extension of a result by Gauss who stated that if m is the mode (mind not the expected value: in this is the V-P extension) of a unimodal random variable then 1 − 2 2 if λ ≥ √2 3λ 3 P (| X − m |< λτ ) ≥ √λ if 0 ≤ λ ≤ √23 . 3 Where τ 2 = E(X − m)2 . 19.27 *Cantelli One Sided Inequality A less well known but useful inequality is the Cantelli ore one sided Tchebicev inequality, which, phrased in a way useful for left tail sensitive risk managers, becomes: λ2 1 + λ2 of the probability (80%) is above the µ − 2σ P (X − µ ≥ λσ) ≥ and for λ = −2 this means that at least 45 lower boundary. For “minus six sigma” this becomes 97.29(9729)%. 19.28 Quantiles • The “α-quantile” of X is defined as the value qα such that, the following conditions are simultaneously valid: P r[X < qα ]≤α P r[X≤qα ]≥α • Notice that in the case of a random variable with continuousFX (.) this equation could be written as qα ≡ inf (t : FX (t) = α) and in the case of a continuous strictly increasing FX (.) this becomes qα ≡ t : FX (t) = α • For a non continuous FX (.) in case αis NOT one of the values taken by FX (.) the above definition corresponds to a value of x ofX with FX (x) > α. • Due to applications in the definition of VaR it is more proper to use as quantile, in this case, a qα defined as the maximum value x of X with positive probability and with FX (x) ≤ α. • The formal definition of this is rather cumbersome qα ≡ max {x : [P r[X ≤ x] > P r[X < x]] ∩ [P r[X ≤ x] ≤ α]} • Which reads “qα is the greatest value of x such that it has a positive probability and such that FX (x) ≤ α 236 19.29 Median • If α = 0.5 we call the corresponding quantile the “median” of X and use for it, usually, the symbol Md . • It may be interesting to notice that, if G is continuous and increasing, we have Md (G(X)) = G(Md (X)). 19.30 Subsection • A mode in a discrete probability distribution (or frequency distribution) is any value of x ∈ Supp(X) where the probability (frequency) has a local maximum. • “The mode”, usually, is the global maximum. • In the case of densities, the same definition is applied in terms of density instead of probability (frequency). 19.31 Univariate Distributions Models • Models for univariate distributions come in two kinds: non parametric and parametric. • A parametric model is a family of functions indexed by a finite set of parameters (real numbers) and such that for any value of the parameters in a predefined parameter space the functions are probability densities (continuous case) or probability functions (discrete case). • A non parametric model is a model where The family of distributions cannot be indexed by a finite set of real numbers. • It should be noticed that, in many applications, we are not interested in a full model of the distribution but in modeling only an aspect of it as, for instance, the expected value, the variance, some quantile and so on. 19.32 Some Univariate Discrete Distributions • Bernoulli: P (x) = θ, x = 1; P (x) = 1 − θ, x = 0; 0 ≤ θ ≤ 1. You should notice the convention: the function is explicitly defined only on the support of the random variable. For the Bernoulli we have: E(X) = θ, V (X) = θ(1 − θ). • Binomial: P (x) = nx θx (1 − θ)n−x , x = 0, 1, 2, ..., n; 0 ≤ θ ≤ 1. We have:E(X) = nθ; V (X) = nθ(1 − θ). 237 • Poisson:P (x) = λx e−λ /x!, x = 0, 1, 2, ..., ∞; 0 ≤ λ. We have:E(X) = λ; V (X) = λ. • Geometric P (x) = (1 − θ)x−1 θ, x = 1, 2, ..., ∞; 0 ≤ θ ≤ 1. We have E(X) = 1θ ; V (X) = 1−θ θ2 19.33 Some Univariate Continuous Distributions Negative exponential: f (x) = θe−θx , x > 0, θ > 0. We have: E(X) = 1/θ; V (X) = 1/θ2 . (Here you should notice that, as it is often the case for distributions with constrained support, the variance and the expected value are functionally related). 19.34 Some Univariate Continuous Distributions 1 2 1 Gaussian: f (x) = √2πσ e− 2σ2 (x−µ) , x ∈ R, µ ∈ R, σ 2 > 0. We have E(X) = µ,V (X) = 2 σ 2 . A very important property of this random variable is that, if a and b are constants, then Y = aX + b is a Gaussian if X is a Gaussian. By the above recalled rules on the E and V operators we have also that E(Y ) = aµ+b; V (Y ) = a2 σ 2 . In particular, the transform Z = X−µ is distributed as a “standard” σ (expected value 0, variance 1) Gaussian. 19.35 Some Univariate Continuous Distributions The distribution function of a standard Gaussian random variable is usually indicated with Φ, so Φ(x) is the probability of observing values of the random variable X which are smaller then or equal to the number x, in short: Φ(x) = P (X ≤ x). With z1−α = Φ−1 (1 − α) we indicate the inverse function of Φ that is: the value of the standard Gaussian which leaves on its left a given amount of probability. Obviously Φ(Φ−1 (1 − α)) = 1 − α. 19.36 Random Vector • A random vector X of size n is a n- dimensional vector function from Ω toRn , that is: a function which assigns to each ω ∈ Ω a vector of n real numbers. • The name “random vector” is better than the name “vector of random variables” in that, while each element of a random vector is, in fact, a random variable, a simple vector of random variables could fail to be a random vector if the arguments ωi of the different random variables are not constrained to always coincide. • (If you understand this apparently useless subtlety you are well on your road to understanding random vectors, random sequences and stochastic processes). 238 19.37 Distribution Function for a Random Vector • Notions of measurability analogous to the one dimensional case are required to random vectors but we do not mention these here. • Just as in the case of random variable, we can define probability distribution functions for random vectors as FX (t1 , t2 , ..., tn ) = P ({ω} : X1 (ω) ≤ t1 ,X2 (ω) ≤ t2 , ..., Xn (ω) ≤ tn ) where the commas in this formulas can be read as logical “and” and, please, notice again that the ω for each element of the vector is always the same. 19.38 Density and Probability Function As well as in the one dimensional case, we usually do not model a random vector by specifying its probability distribution function but its probability function: P (x1 , ..., xn ) or its density: f (x1 , ..., xn ), depending on the case. 19.39 Marginal Distributions • In the case of random vectors we may be interested in “marginal” distributions, that is: probability or density functions of a subset of the original elements in the vector. • If we wish to find the distribution of all the elements of the vector minus, say, the i-th element we simply work like this: • in the discrete case P (x1 , ..., xi−1 , xi+1 , ...xn ) = X P (x1 , ..., xi−1 , xi , xi+1 ...xn ) xi ∈Supp(Xi ) • and in the continuous case: ˆ f (x1 , ..., xi−1 , xi+1 , ...xn ) = f (x1 , ..., xi−1 , xi , xi+1 ...xn )dxi xi ∈Supp(Xi ) • We iterate the same procedures for finding other marginal distributions. 19.40 Conditioning • Conditional probability functions and conditional densities are defined just like conditional probabilities for events. 239 • Obviously, the definition should be justified in a rigorous way but this is not necessary, for now! • The conditional probability function of, say, the first i elements in a random vector given, say, the other n − i elements shall be defined as: P (x1 , ..., xi |xi+1 , ...xn ) = P (x1 , ..., xn ) P (xi+1 , ...xn ) • For the conditional density we have: f (x1 , ..., xi |xi+1 , ...xn ) = f (x1 , ..., xn ) f (xi+1 , ...xn ) • In both formulas we suppose denominators to be non zero. 19.41 Stochastic Independence • Two sub vectors of a random vector, say: the first i and the other n − i random variables, are said to be stochastically independent if the joint distribution is the same as the product of the marginals or, that is the same under our definition, if the conditional and marginal distribution coincide. • We write this for the density case, for the probability function is the same: f (x1 , ..., xn ) = f (x1 , ..., xi )f (xi+1 , ..., xn ) f (x1 , ..., xi |xi+1 , ...xn ) = f (x1 , ..., xi ) • This must be true for all the possible values of the n elements of the vector. 19.42 Mutual Independence • A relevant particular case is that of a vector of mutually independent (or simply independent) random variables. In this case: Y f (x1 , ..., xn ) = fXi (xi ) i=1,...,n • Again, this must be true for all possible (x1 , ..., xn ). (Notice: the added big subscript to the uni dimensional density to distinguish among the variables and the small cap xi which are possible values of the variables). 240 19.43 Conditional Expectation • Given a conditional probability function P (x1 , ..., xi |xi+1 , ...xn ) or a conditional density f (x1 , ..., xi |xi+1 , ...xn ) we can define conditional expected values of, in general, vector valued functions of the conditioned random variables. • Something the like of E(g(x1 , ..., xi )|xi+1 , ...xn )) (the expected value is defined exactly as in the uni dimensional case by a proper sum/series or integral operator). 19.44 Conditional Expectation • It is to be understood that such expected value is a function of the conditioning variables. If we understand this it should be not a surprise that we can take the expected value of a conditional expected value. In this case the following property is of paramount relevance: E(E(g(x1 , ..., xi )|xi+1 , ...xn )) = E(g(x1 , ..., xi )) • Where, in order to understand the formula, we must remember that the outer expected value in the left hand side of the identity is with respect to (wrt) the marginal distribution of the conditioning variables vector: (xi+1 , ...xn ), while the inner expected value of the same side of the identity is wrt the conditional distribution. Notice that, in general, this inner expected value: E(g(x1 , ..., xi )|xi+1 , ...xn ) is a function of the conditioning variables (the conditioned variables are “integrated out” in the operation of taking the conditional expectation) so that it is meaningful to take its expected value with respect to the conditioning variables. • The expected value on the right hand side, however, is with respect to the marginal distribution of the conditioned variables (x1 , ..., xi ). 19.45 Conditional Expectation • To be really precise we must say that the notation we use (small printed letters for both the values and the names of the random variables) is approximate: we should use capital letters for variables and small letters for values. However we follow the practice that usually leaves the distinction to the discerning reader. 19.46 Law of Iterated Expectations • The above property is called “law of iterated expectations” and can be written in much more general ways. 241 • In the simplest case of two vectors we have: EY (EX|Y (X|Y)) = EX (X). For the conditional expectation value, wrt the conditioned vector, all the properties of the marginal expectation hold. 19.47 Regressive Dependence • Regression function and regressive dependence. • Being a function of Y, the conditional expectation EX|Y (X|Y) is also called “regression function” of X on Y. Analogously, EY|X (Y|X) is the regression function of Y on X. If EX|Y (X|Y) is constant wrt Y we say that X is regressively independent on Y. • If EY|X (Y|X) is independent of X we say that Y is regressively independent on X. • Regressive dependence/independence is not a symmetric concept: it can hold on a side only. • Moreover, stochastic independence implies two sided regressive independence, again, the converse is not true. • A tricky topic: conditional expectation is, in general, a “static” concept. For any GIVEN set of values of, say, Y you compute EX|Y (X|Y). However, implicitly, the term “regression function” implies the possibility of varying the values of the conditioning vector (or variable). This must be taken with the utmost care as it is at the origin of many misunderstandings, in particular with regard to “causal interpretations” of conditional expectations. The best, if approximate, idea to start with is that EX|Y (X|Y) gives us a “catalog” of expected values each valid under given “conditions” Y, be it or not be it possible or meaningful to “pass” from one set of values of Y to another set. 19.48 Covariance and Correlation • The covariance between two random variablesX and Y is defined as: Cov(X, Y) = E(XY) − E(X)E(Y). • From the above definition we get that, for any set of constants a, b, c, d Cov(a + bX, c + dY) = bdCov(X, Y). • An p important result (Cauchy inequality) allows us to show that |Cov(X, Y)| ≤ V (X)V (Y). From this we derive a “standardized covariance” called “correlation p coefficient”: Cor(X, Y) = Cov(X, Y)/ V (X)V (Y). 242 • We have Cor(a + bX, c + dY) = Sign(bd)Cor(X, Y). • The square of the correlation coefficient is usually called R square or rho square. • Notice that, regressive independence, even only unilateral, implies zero covariance and zero correlation, the converse, however, is in general not true. 19.49 Distribution of the max and the min for independent random variables • Let {X1 , ..., Xn } be independent random variables with distribution functions FXi (.). • Let X(1) = max{X1 , ..., Xn } and X(n) = min{X1 , ..., Xn }. Q Q • Then FX(1) (t) = ni=1 FXi (t) and FX(n) (t) = 1 − ni=1 (1 − FXi (t)). • If the random variables are also identically distributed we have FX(1) (t) = n Y FXi (t) = F n (t) i=1 and FX(n) = 1 − . 19.50 n Y (1 − FXi (t)) = 1 − (1 − F (t))n i=1 Distribution of the max and the min for independent random variables • Why? Consider the case of the max. FX(1) (t) is , by definition, the probability that the value of the max among the n random variables is less than or equal to t. • But the max is less than or equal t if and only if each random variable is less than or equal to t. • Since they are independent this is givenQby the product of the FXi each computed at the same point t, that is FX(1) (t) = ni=1 FXi (t). • For the min: 1 − FX(n) (t) is the probability that the min is greater that t. But this is true if and only if each of the n random variables has a value greater than t and for each random variable this probability is 1 − FXi (t). they are independent, so... 243 19.51 Distribution of the sum of independent random variables and central limit theorem • Let {X1 , ..., Xn } be independent random variables. Let Sn = sum. Pn i=1 Xi be their 2 • We know that,Pif each random variable Pn has2 expected value µi and variance σi , n then E(Sn ) = i=1 µi and V (Sn ) = i=1 σi . • To be more precise: the first property is always valid, whatever the dependence, provided the expected values exist, while the second only requires zero correlation (provided the variances exist). • Can we say something about the distribution of Sn ? • If we knew the distributions of the Xi we could (but this could be quite cumbersome) compute the distribution of the sum. • However, if we do not know (better: do not make hypotheses on) the distributions of the Xi we still can give proof to a powerful and famous result which, in its simplest form, states: 19.52 Distribution of the sum of independent random variables and central limit theorem • Let {X1 , ..., Xn } be iid random variables with expected value µ and variance σ 2 . Then ! Sn − µ lim P r n √ ≤ t = Φ(t) n→∞ σ/ n Where, as specified above, Φ(.) is the PDF of a standard Gaussian. • In practice this means that, under the hypotheses of this theorem, if “n is big enough ” (a sentence whose meaning should be, and can be, made precise) we s −µ n√ ). can approximate FSn (s) with Φ( σ/ n 19.53 Distribution of the sum of independent random variables and central limit theorem • More general versions of this theorem, with non necessarily identically distributed or even non independent Xi exist. 244 • This result is fundamental in statistical applications where confidence levels for confidence intervals of size of errors for tests must be computed in non standard settings. Statistical inference 19.54 Why Statistics • Probabilities are useful when we can specify their values. As we did see above, sometimes, in finite settings, (coin flipping, dice rolling, card games, roulette, etc.) it is possible to reduce all probability statement to simple statements judged, by symmetry properties, equiprobable. • In these case we say we “know” probabilities (at least in the sense we agree on its values and, as a first approximation, do not look for some “discovery rule” for probabilities) and use these for making decisions (meaning: betting). In other circumstances we are not so lucky. • This is obvious when we consider betting on horse racing, computing insurance premia, investing in financial securities. In all these fields “symmetry” statements are not reasonable. • However, from the didactic point of view, it is useful to show that the ”problem” is there even with simple physical “randomizing devices” when their “shape” does not allow for simple symmetry statements. • Consider for instance rolling a pyramidal “die”: this is a five sided object with four triangular sides a one square side. In this case what is the probability for each single side to be the down side? For some news on dice see http: //en.wikipedia.org/wiki/Dice 19.55 Unknown Probabilities and Symmetry • The sides are not identical, so the classical argument for equiprobability does not hold. We may agree that the probability of each triangular face is the same as the dice is clearly symmetric if seen with the square side down. But then: what is the total value of these four probabilities? Or, that is the same, what is the probability for the square face to be the down one? • Just by observing different pyramidal dice we could surmise that the relative probability of the square face and of the four triangular faces depend, also, on the effective shape of the triangular faces. We could hypothesize, perhaps, that 245 the greater is the eight of such faces, the bigger the probability for a triangular face to be the down one in comparison to the probability for the square face. 19.56 Unknown Probabilities and Symmetry • With skillful physical arguments we could come up with some quantitative hypotheses, we understand, however, that this shall not be simple. With much likelihood a direct observation of the results from a series of actual rolls of this dice could be very useful. • For instance we could observe, not simply hypothesize, that (for a pyramid made of some homogeneous substance) the more peaked are the triangular sides (and so the bigger their area for a given square basis of the pyramid) the smaller the probability for the square side to be the one down after the throw. We could also observe, directly or by mind experiment, that the “degenerate” pyramid having height equal to zero is, essentially, a square coin so that the probability of each side (the square one and the one which shall transform in the four triangles) should be 1/2. From these two observations and some continuity argument we could conclude that there should be some unknown height such that the probability of falling on a triangular side is, say, 1 > c ≥ 1/2 and, by symmetry, the probability for each triangular side, is c/4. This provided there is no cheating in throwing the die so that the throw is “chaotic” enough. So, beware of magicians! • What is interesting is that this “mental+empirical” analysis gives us a possible probability model for the result of throwing our pyramidal die. Moreover, this model could be enriched by some law connecting c with the height of the pyramid. Is c proportional to the height? Proportional to the square of the height? To the square root of the height? As we shall see in what follows Statistics could be a tool for choosing among these hypotheses. • By converse, suppose you know, from previous analysis, that, for a pyramid made of homogeneous material, a good approximation is c proportional to the height. In this case a good test to assess the homogeneity of the material with which the pyramid is made could be that of throwing several pyramids of different height and see if the ratio between the frequency of a triangular face and the height of the pyramid is a constant. 19.57 No Symmetry • Consider now a different example: horse racing. Here the event whose probability we are interested in is, to be simple, the name of the winner. 246 • It is “clear” that symmetry arguments here are useless. Moreover, in this case even the use of past data cannot mimic the case of the pyramid, while observation of past races results could be relevant, the idea of repeating the same race a number of times in order to derive some numerical evaluation of probability is both unpractical and, perhaps, even irrelevant. 19.58 No Symmetry • What we may deem useful are data on past races of the contenders, but these data regard different track conditions, different tracks and different opponents. • Moreover they regard different times, hence, a different age of the horse(s), a different period in the years, a different level of training, and so on. • History, in short. • This not withstanding, people bet, and bet hard on such events since immemorial past. Where do their probabilities come from? • An interesting point to be made is that, in antiquity, while betting was even more common than it is today (in many cultures it had a religious content: looking for the favor of the gods), betting tools, like dice existed in a very rudimentary form with respect to today. We know examples of fantastically “good” dice made of glass or amber (many of these being not used for actual gambling but as offers to the Deity). These are very rare. The most commonly used die came from a roughly cubic bone off a goat or a sheep. In this case symmetry argument were impossible and experience could be useful. • An interesting anthropological fact is that in classical times gambling was very common, the concept of chance and luck were so widespread to merit specific deities. However no hint of any kind of “uncertainty quantification” is known, with the exception of some side comment. Why this is the case is a mystery. It may be that the religious content mentioned above made in some sense blasphemous the idea of quantifying chance, but this is only an hypothesis. 19.59 Learning Probabilities • Let us sum up: probability is useful for taking decision (betting) when the only unknown is the result of the game. • This is the typical case in simple games of chance (not in the, albeit still simple, pyramidal dice case). 247 • If we want to use probability when numerical values for probability are not easily derived, we are going to be uncertain both on uncertain results and on the probability of such results. • We can do nothing (legal) about the results of the game, but we may do something for building some reasonable way for assessing probabilities. In a nutshell this is the purpose of Statistics. • The basic idea of statistic is that, in some cases, we can “learn” probabilities from repeated observations of the phenomena we are interested in. • The problem is that for “learning” probabilities we need ... probabilities! 19.60 Pyramidal Die • Let us work at an intuitive level on a specific problem. Consider this set of basic assumptions concerning the pyramidal die problem. • We may agree that the probability for each face to be the down one in repeated rollings of the die is constant, unknown but constant. • Moreover, we may accept that the order with which results are recorded is, for us, irrelevant as “experiments” (rolls of the dice) are made always in the same conditions. • We, perhaps, shall also agree that the probability of each triangular face is the same. 19.61 Pyramidal Die Model • Well: we now have a “statistical model”. Let us call θi , i = 1, 2, 3, 4 the probabilities of each triangular face. • This are going to be non negative numbers (Probability Theory require this) moreover, if we agree with the statement about their identity, each of these value must be equal to the same θ so the total probability for a triangular face to be the down one shall be 4θ. • By the rules of probability, the probability for the square face is going to be 1−4θ and, since this cannot be negative, we need θ ≤ .25 (where we perhaps shall avoid the equal part in the ≤sign). • If we recall the previous analysis we should also require θ ≥ 1/8. 248 19.62 Pyramidal Die Constraints • All these statements come from Probability Theory joint with our assumptions on the phenomenon we are observing. • In other, more formal, words we specify a probability model for each roll of the die and state this: • In each roll we can have a result in the range 1,2,3,4,5; • The probability of each of the first four values is θ and this must be a number not greater than .25. • With just these words we have hypothesized that the probability distribution of each result in a single toss is an element of a simple but infinite and very specific set of probability distributions completely characterized by the numerical value of the “parameter” θ which could be any number in the “parameter space” given by the real numbers between 1/8 and 1/4 (left extreme included if you like). 19.63 Many Rolls • This is a model for a single rolling. But, exploiting our hypotheses, we can easily go on to a model for any set of rollings of the dice. • In fact, if we suppose, as we did, that each sequence of results of given length has a probability which only depends on the number of triangular and square faces observed in the series (in technical terms we say that the observation process produces an “exchangeable” sequence of results, that is: sequences of results containing the same number of 5 and non 5 have the same probability). • Just for simplicity in computation let us move on a step: we shall strengthen our hypothesis and actually state that the results of different rollings are stochastically independent (this is a particular case of exchangeability that is: implies but is not implied by exchangeability). 19.64 Probability of Observing a Sample • Under this hypothesis and the previously stated probability model for each single roll, the joint probability of a sample of size n, were we only record 5s and not 5s, is just the product of the probabilities for each observation. • In our example: suppose we roll the dice 100 times and observe 40 times 5 (square face down) and 60 times either 1 or 2 or 3 or 4, since each of these faces is incompatible with the other and each has probability θ, the probability of “either 1 or 2 or 3 or 4” is 4θ. 249 • The joint probability of the observed sample is thus (4θ)60 (1 − 4θ)40 . 19.65 Pre or Post Observation? But here there is a catch, and we must understand this well: are we computing the probability of a possible sample before observation, or the probability of the observed sample? In the first case no problems, the answer is correct, but, in the second, we must realize that the probability of observing the observed sample is actually one, after all we DID observe it! • Let us forget, for the moment, this subtlety which is going to be relevant in what follows. We have the probability of the observed sample, since the sample is given, the only thing in the formula which can change value is the parameter θ. • The probability of observing the given sample shall, in general, be a function of this parameter. 19.66 Maximize the Probability of the Observed Sample • The value which maximizes the probability of the observed sample among the possible values of θ is (check it!) θb =60/400=3/20=.15 • Notice that this value maximizes (4θ)60 (1 − 4θ)40 : the probability of observing the given sample (or any specific sample containing 40 5s and 60 non 5s) given 100 60 but also maximizes 40 (4θ) (1 − 4θ)40 that is: the probability of observing A sample in the set of samples containing 40 5s and 60 non 5s. (Be careful in understanding the difference between “the given sample ” and “A sample in the 100 100 set”, moreover notice that 40 = 60 ). 19.67 Maximum Likelihood • Stop for a moment and fix some points. What did we do, after all? Our problem was to find a probability for each face of the pyramidal dice. The only thing we could say a priori was that the probability of each triangular face was the same. From this and simple probability rules we derived a probability model for the random variable X whose values are 1, 2, 3, 4 when the down face is triangular, and 5 when it is square. • We then added an assumption on the sampling process: observations are iid (independent and identically distributed as X). The two assumptions constitute a “statistical model” for X and are enough for deriving a strategy for “estimating” θ (the probability of any given triangular face). 250 • The suggested estimate is the value θb which maximizes the joint probability of observing the sample actually observed. In other words we estimated the unknown parameter according to the maximum likelihood method. 19.68 Sampling Variability • At this point we have an estimate of θ and the first important point is to understand that this actually is just an estimate, it is not to be taken as the “true” value of θ. • In fact, if we roll the dice another 100 times and compute the estimate with the same procedure, most likely, a different estimate shall come up and for another sample, another one and so on and on. • Statisticians do not only find estimates, most importantly they study the worst enemy of someone which must decide under uncertainty and unknown probabilities: sampling variability. 19.69 Possibly Different Samples • The point is simple: consider all possible different samples of size 100. Since, as we assumed before, the specific value of a non 5 is irrelevant, let us suppose, for simplicity, that all that is recorded in a sample is a sequence of 5s and non 5s. • Since in each roll we either get a 5 or a non 5 the total number of these possible samples is 2100 . • On each of these samples our estimate could take a different value, consider, however, that the value of the estimate only depends of how many 5 and non 5 were observed in the specific sample (the estimate is the number of non 5 divided by 4 times 100). • So the probability of observing a given value of the estimate is the same as the probability of the set of samples with the corresponding number of 5s. 19.70 The Probability of Our Sample • But it is easy to compute this probability: since by our assumptions on the statistical model, every sample containing the same number of 5s (and so of non 5s) has the same probability, in order to find this probability we can simply compute the probability of a generic sample of this kind and multiply it times the number of possible samples with the same number of 5s. 251 • If the number of 5s is, say, k we find that the probability of the generic sample with k 5s and 100-k non 5s is (see above): (4θ)100−k (1 − 4θ)k . 19.71 The Probability of a Similar Estimate • This is the same for any sample with k 5 and 100-k non 5. There are many samples of this kind, depending on the order of results. The number of possible samples of this kind can be computed in this simple way: we must put k 5s in a sequence of 100 possible places. • We can insert the first 5 in any of 100 places, the second in any of 99 and so on. 100! however there are k ways to choose the • We get 100 ∗ 99 ∗ ... ∗ (100 − k) = (100−k)! first 5 k − 1 for the second and so on up to 1 for the k th and for all these k! ways (they are called “combinations” the sample is always the same, so the number of 100! different samples is k!(100−k)! . = 100 k 19.72 The Probability of a Similar Estimate • This is the number of different sequences of “strings” of 100 elements each containing k 5s and 100-k non 5s. • Summing up: the probability of observing k 5s on 100 hence of computing rolls, 100 100−k and estimate of θ equal to k/400, is precisely: k (4θ) (1 − 4θ)k (which is a trivial modification of the binomial). 19.73 The Probability of a Similar Estimate • So, before sampling, for any possible “true” value of θ we have a different probability for each of the (100 in this case) possible values of the estimate. • The reader shall realize that, for each given value of θ the a priori (of sampling) most probable value of the estimate is the one corresponding to the integer number of 5s nearest to 100(1 − 4θ) (which in general shall not be integer). 19.74 The Estimate in Other Possible Samples • Obviously, since this is just the most probable value of the estimate if the probability is computed with this θ, it is quite possible, it is in fact very likely, that a different sample is observed. 252 this immediately implies that, in • Since our procedure is to estimate θ with 100−k 400 the case the observed sample is not the most probable for that given θ, the value of the estimate shall NOT be equal to θ, in other words it shall be “wrong” and the reason of this is the possibility of observing many different samples for each given “true” θ, that is: sampling variability. • In general, using the results above, for any given θ, the probability observing a of n−k n sample of size n which gives as an estimate n−k is (as above) (4θ) (1 − 4θ)k 4n k 19.75 The Estimate in Other Possible Samples • So, for instance, the probability, given this value of θ, of observing a sample such is equal to the parameter value, is, if we that, for instance, the estimate n−k 4n suppose that the value for θ which we use in computing this probability can be written as n−k (otherwise the probability is 0 and we must use intervals of values) 4n n − k n−k n − k k n n − k n−k n−k k (4 ) (1 − 4 ) = ( ) (1 − ) k 4n 4n k n n n • Due to what we did see above the value n−k is the most probable value of the 4n but many other values may have sizable probability so estimate when θ = n−k 4n n−k that, eve if the “true value” is θ = 4n it is possible to observe estimates different than n−k with non negligible probability. 4n 19.76 Sampling Variability • The study of the distribution of the estimate given θ is called the study of the “sampling variability” of the estimate: the attitude of the estimate to change in different samples and can be done in several different ways. • For instance, using again our example, we see clearly that there does not exist a single “sampling distribution” of the estimate as there is one for each value of the parameter. • On one hand this is good, because otherwise the estimate would give us quite poor information about θ: the information we get from the estimate comes exactly from the fact that for different values of θ different values of the estimate are more likely to be observed. • On the other it does not allow us to say which is the “sampling distribution” of the estimate but only gives us a family of such distribution. 253 19.77 Sampling Variability • However, even if we do not know the value of the parameter we may study several aspects of the sampling distribution. • For instance, for each θ we can compute, given that θ the expected value of the estimate for the distribution of the estimate with that particular value of θ. In other words we could compute n X n − k n (4θ)n−k (1 − 4θ)k 4n k k=0 and by doing this computation we would see that the result is θ itself, no matter which value has θ. So that we say that the estimate is unbiased. 19.78 Sampling Variability • Again, for each θ we can compute the variance of the of the estimate for the distribution of the estimate with that particular value of θ. That is, we could compute n X 4θ(1 − 4θ) n − k 2 n ) (4θ)n−k (1 − 4θ)k − θ2 = ( 4n k 4n k=0 the “sampling variance” of the estimate, and see that, while this is a function of θ (whose value is unknown to us) for any value of θ it goes to 0 when n goes to infinity. This, joint with the above unbiasedness result, implies (Tchebicev inequality) that the probability of having n−k ∈ [θ ± c] 4n that is: of observing a value of the estimate different than θ at most of c, goes to 1 for ANY c > 0 no matter the value of θ. This is called “mean square consistency”. 19.79 Sampling Variability • A curiosity. In typical applications the sampling variance depends on the unknown parameter(s). • While any reasonable estimate must have a sampling distribution depending on the unknown parameter(s) there are cases where the sampling variance could be independent on unknown parameter(s). 254 • For instance, in iid sampling from an unknown distribution with unknown expected value µ and known standard deviation σ the usual estimate of µ, the 2 arithmetic mean of the data, has a sampling variance equal to σn which does not depend on unknown parameters (repeat: we assumed σ known). 19.80 Estimated Sampling Variability • In the end, if, say we wish for some “number” for the sampling variance when, as in our case, it depends on the unknown parameter and not the simple formula 4θ(1−4θ) or some specific distribution in the place of the family of distributions 4n , n−k n (4θ) (1 − 4θ)k we could “estimate” these substituting in the formula the k estimate of θ to the unknown value θ̂ = n−k and get 4n θ̂) n n−k • V̂ (θ̂) = 4θ̂(1−4 and P̂ ( θ̂ = ) = (4θ̂)n−k (1 − 4θ̂)k and always remember to 4n 4n k notice the “hats” on V and P . 19.81 Quantifying Sampling Variability • Whatever method we use for dealing with sampling variability the point is to face it • We could find different procedures for computing our estimate, however, for the same reason (for each given true value of θ many different samples are possible) any reasonable estimate always a sampling distribution (in reasonable cases depending on θ), so we would in any case face the same problem:sampling variability. • The point is not to avoid sampling variability but to live with it. In order to do this it is better to follow some simple principles. • Simple, yes, but so often forgotten, even by professionals, as to create most problems encountered in practical applications of Statistics. 19.82 Principle 1 • The first obvious principle to follow in order to be able to do this is: “do not forget it”. • An estimate is an estimate is an estimate, it is not the “true” θ. • This seems obvious but errors of this kind are quite common: it seems human brain does not like uncertainty and, if not properly conditioned, it shall try in any possible way, to wrongly believe that we are sure about something on which we only posses some clue. 255 19.83 Principle 2 • The second principle is “measure it”. • An estimate (point estimate) by itself is almost completely useless, it should always be supplemented with information about sampling variability. • At the very least information about sampling standard deviation should be added. Reporting in the form of confidence intervals could be quite useful. • This and not point estimation is the most important contribution Statistics may give to your decisions under uncertainty. 19.84 Principle 3 • The third principle is “do not be upset by it”. • Results of decision may upset you even under certainty. This is obviously much more likely when chance is present even if probabilities are known. • We are at the third level: no certainty, chance is present, probabilities are unknown! • The best Statistics can only guarantee an efficient and logically coherent use of available information. • It does not guarantee Luck in “getting the right estimates” and obviously it cannot guarantee that, even if probabilities are estimated well something very unlikely does not happen! (And no matter what, People shall always expect, forgive the joke, that what is most probable is much more likely than it is probable). 19.85 The Questions of Statistics • This long discussion should be useful as an introduction to the statistical problem: • why we need to do inference and do not simply use Probability? • what can we expect from inference? • Now let us be a little more precise. 256 19.86 Statistical Model • This is made of two ingredients. • The first is a probability model for a random variable (or more generally a random vector, but here we shall consider only the one dimensional case). • This is simply a set of distributions (probability functions or densities) for the random variable of interest. The set can be indexed by a finite set of numbers (parameters) and in this case we speak of a parametric model. Otherwise we speak of a non parametric model. • The second ingredient is a sampling model that is: a probabilistic assessment about the joint distribution of repeated observation on the variable of interest. • The simplest example of this is the case of independent and identically distributed observations (simple random sampling). 19.87 Specification of a Parametric Model • Typically a parametric mode is specified by choosing some functional form for the probability or density function (here we use the symbol P for both) of the random variable X say: X P (X; θ) and a set of possible values for θ : θ ∈ Θ(in the case of a parametric model). • Sometimes we do not fully specify P but simply ask, for instance, for Xto have a certain expected value or a certain variance. 19.88 Statistic • A fundamental concept is that of “estimate” or “statistic”. Given a sample: X and estimate is simply a function of the sample and nothing else: T (X). • In other words it cannot depend on unknowns the like of parameters in the model. Once the sample is observed the estimate becomes a number. 19.89 Parametric Inference • When we have a parametric model we typically speak about “parametric inference”, and we are going to do so here. • This may give the false impression that statistician are interested in parameter values. 257 • Sometimes this may be so but, really, statisticians are interested in assessing probabilities for (future) values of X, parameters are just “middlemen” in this endeavor. 19.90 Different Inferential Tools • Traditionally parametric inference is divided in three (interconnected) sections: • Point estimation; • Interval estimation; • Hypothesis testing. 19.91 Point Estimation • In point estimation we try to find an estimate T (X) for the unknown parameter θ (the case of a multidimensional parameter is completely analogous). • In principle, any statistic could be an estimate, so we discriminate between good and bad estimates by studying the sampling properties of these estimates. • In other words we try to asses whether a given estimate sampling distribution (that is, as we did see before, the probability distribution of the possible values of the statistic as induced by the probabilities of the different possible samples) enjoys or not a set of properties we believe useful for a good estimate. 19.92 Unbiasedness • An estimate T (X) is unbiased for θ iff Eθ (T (X)) = θ, ∀θ ∈ Θ. In order to understand the definition (and the concept of sampling distribution) is important to realize that, in general, the statistic T has a potentially different expected value for each different value of θ (hence each different distribution of the sample). • What the definition ask for is that this expected value always corresponds to the θ which indexes the distribution used for computing the expected value itself. 19.93 Mean Square Error • We define the mean square error of an estimate T as: M SEθ (T ) = Eθ ((T − θ)2 ) . • Notice how, in this definition, we stress the point that the M SE is a function of θ (just like the expected value of T ). 258 • We recall the simple result: Eθ ((T − θ)2 ) = Eθ ((T − Eθ (T ) + Eθ (T ) − θ)2 ) = = Eθ ((T − Eθ (T ))2 ) + (Eθ (T ) − θ)2 where the first term in the sum is the sampling variance of the estimate and the second is the “bias”. • Obviously, for an unbiased estimate, M SE and sampling variance are the same. 19.94 Mean Square Efficiency • Suppose we are comparing two estimates for θ, say: T1 and T2 . • We state that T1 is not less efficient than T2 if and only if M SEθ (T1 ) ≤ M SEθ (T2 ) ∀θ ∈ Θ. • As is the case of unbiasedness the most important point is to notice the “for all” quantifier (∀). • This implies, for instance, that we cannot be sure, given two estimates, whether one is not worse than the other under this definition. • In fact it may well happen that mean square errors, as functions of the parameter “cross”, so that one estimate is “better” for some set of parameter values while the other for a different set. • In other words, the order induced on estimates by this definition is only partial. 19.95 Meaning of Efficiency If an estimate is T1 satisfies this definition wrt another estimate T2 , this means (use Tchebicev inequality and the above decomposition of the mean square error) that it shall have a bigger (better: not smaller) probability of being “near” θ for any value of this parameter, than T2 . 19.96 Mean Square Consistency • Here we introduce a variation. Up to now properties consider only fixed sample sizes. here, on the contrary, we consider the sample size n as a variable. • Obviously, since an estimate is defined on a given sample, this new setting requires the definition of a sequence of estimates and the property we are about to state is not a property of an estimate but of a sequence of estimates. 259 19.97 Mean Square Consistency • A sequence {Tn } of estimates is termed “mean square consistent if and only if lim M SEθ (Tn ) = 0, ∀θ ∈ Θ. n→∞ • You should notice again the quantifier on the values of the parameter. • Given the above decomposition of the M SE the property is equivalent to the joint request: lim Eθ (Tn ) = θ, ∀θ ∈ Θ and lim Vθ (Tn ) = 0, ∀θ ∈ Θ. n→∞ n→∞ • Again, using Tchebicev, we understand that the requirement implies that, for any given value of the parameter, the probability of observing a value of the estimate in any given interval containing θ goes to 1 if the size of the sample goes to infinity. 19.98 Methods for Building Estimates We could proceed by trial and error: this would be quite time consuming. better to devise some “machinery” for creating estimates which can reasonably expect to be “good” in at least some of the above defined senses. 19.99 Method of Moments • Suppose we have a iid (to be simple) sample X from a random variable X distributed according to some (probability or density) P (X; θ) θ ∈ Θ where the parameter is, in general, a vector of k components. • Suppose, moreover, X has got, say, n moments E(X m ) with m = 1, ..., n. • In general we shall have E(X m ) = gm (θ) that is: the moments are functions of the unknown parameters. 19.100 Estimation of Moments • Now, under iid sampling, it is very easy to estimate moments in a way that is, at least, unbiased and mean square consistent (and also, under proper hypotheses, efficient). m b m) = P • In fact the estimate: E(X i=1,...,n X /n that is: the m−th empirical moment is immediately seen to be unbiased, while its MSE (the variance, in this m case) is V (Xn ) which (if it exists) obviously goes to 0 if the size n of the sample goes to infinity. 260 19.101 Inverting the Moment Equation • The idea of the method of moment is simple. Suppose for the moment that θ is one dimensional. • Choose any gm and suppose it is invertible (if the model is sensible, this should be true. Why?). • Estimate the correspondent moment of order m with the empirical the Pmoment of −1 m b same order and take as an estimate of θ the function θm = gm ( i=1,...,n X /n). • In the case of k parameter just solve with respect to the unknown parameter a system on k equation connecting the parameter vector with k moments estimated with the corresponding empirical moments. 19.102 Problems • This procedure is intuitively alluring. However we have at least two problem. The first is that any different choice of moments is going to give us, in general, a different estimate (consider for instance the negative exponential model and estimate its parameter using different moments). • The Generalized Method of Moments tries to solve this problem (do not worry! this is something you may ignore, for the moment). • The second is that, while empirical moments under iid sampling are, for instance, unbiased estimates of corresponding theoretical moments, this is usually not true for method of moments estimates. This is due to the fact that the gm we use are typically not linear. • Under suitable hypotheses we can show that method of moments estimates are means square consistent but this is usually all we can say. 19.103 Maximum Likelihood • Maximum likelihood method (one of the many inventions of Sir R. A. Fisher: the creator of modern mathematical Statistics and modern mathematical genetics). • Here the idea is clear if we are in a discrete setting (i.e. if we consider a model of a probability function). • The first step in the maximum likelihood method is to build the joint distribution of the sample. Q • In the context described above (iid sample) we have P (X; θ) = i P (Xi ; θ). 261 • Now, observe the sample and change the random variables in this formulas (Xi ) into the corresponding observations (xi ). • The resulting P (x; θ) cannot be seen as a probability of the sample (the probability of the observed sample is, obviously, 1), but can be seen as a function of θ given the observed sample: Lx (θ) = P (x; θ). 19.104 Maximum Likelihood • We call this function the “likelihood”. • It is by no means a probability, either of the sample or of θ, hence the new name. • The maximum likelihood method suggests the choice, as an estimate of θ, of the value that maximizes the likelihood function given the observed sample,formally: θbml = arg maxLx (θ). θ∈Θ 19.105 Interpretation • If P is a probability (discrete case) the idea of the maximum likelihood method is that of finding the value of the parameter which maximizes the probability of observing the actually a posteriori observed sample. • The reasoning is exactly as in the example at the beginning of this section. • While for each given value of the parameter we may observe, in general, many different samples, a set of these (not necessarily just one single sample: many different samples may have the same probability) has the maximum probability of being observed given the value of the parameter. 19.106 Interpretation • We observe the sample and do not know the parameter value so, as an estimate, we choose that value for which the specific sample we observe is among the most probable samples. • Obviously, if , given the parameter value, the sample we observe is not among the most probable, we are going to make a mistake, but we hope this is not the most common case and we can show, under proper hypotheses, that the probability of such a case goes to zero if the sample size increases to infinity. 262 19.107 Interpretation • A more satisfactory interpretation of maximum likelihood in a particular case. • Suppose the parameter θ has a finite set (say m) of possible values and suppose that, a priori of knowing the sample, the statistician considers the probability of each of this values to be the same (that is 1/m). • Using Bayes theorem, the posterior probability of a given value of the parameter 1 P (x|θj ) m = h(x)Lx (θj ). given the observed sample shall be:P (θj |x) = P P (x|θ )1 j 19.108 j m Interpretation • In words: if we consider the different values of the parameter a priori (of sample observation) as equiprobable, then the likelihood function is proportional to the posterior (given the sample) probability of the values of the parameter. • So that, in this case, the maximum likelihood estimate is the same as the maximum posterior probability estimate. • In this case, then, while the likelihood is not the probability of a parameter value (it is proportional to it) to maximize the likelihood means to choose the parameter value which has the maximum probability given the sample. 19.109 Maximum Likelihood for Densities • In the continuous case the interpretation is less straightforward. Here the likelihood function is the joint density of the observed sample as a function of the unknown parameter and the estimate is computed by maximizing it. • However, given that we are maximizing a joint density and not a joint probability the simple interpretation just summarized is not directly available. 19.110 Example (Discrete Case) Example of the two methods. Let X be distributed according to the Poisson distribux e−θ x = 0, 1, 2, ... Suppose we have a simple random sample tion, that is: P (x; θ) = θ x! of size n. 19.111 Example Method of Moments • For this distribution all moments exist and, for instance E(X) = θ, E(X 2 ) = θ2 + θ. 263 • If we use the first moment for the estimation√of θ we have θb1 = x̄ but, if we choose the second moment, we have: θb2 = (−1 + 1 + 4x2 )/2 where x2 here indicates the empirical second moment (the average of the squares). 19.112 Example Maximum likelihood • The joint probability of a given Poisson sample is: Lx (θ) = . Q θxi e−θ i xi ! = θ P i xi e−nθ Q xi ! i • For a given θ this probability does not depend on the specific values of each single observation but only on the sum of the observations and the product of the factorials of the observations. • The value of θ which maximizes the likelihood is θbml = x which coincides with the method of moments estimate if we use the first moment as the function to invert. 19.113 More Advanced Topics • Sampling standard deviation, confidence intervals, tests, a preliminary comment. • The following topics are almost not touched in standard USA like undergraduate Economics curricula, and scantly so in other systems. • They are, actually, very important but only vague notions of these can be asked to a student as a prerequisite. • In the following paragraphs such vague notions are shortly described. 19.114 Sampling Standard Deviation and Confidence Intervals • As stated above, a point estimate is useless if it is not provided with some measure of sampling error. • A common procedure is to report the point estimate joint with some measure related to sampling standard deviation. • We say “related” because, in the vast majority of cases, the sampling standard deviation depends on unknown parameters, hence it can only be reported in an “estimated” version. 264 19.115 Sampling Variance of the Mean • The simplest example is this. • Suppose we have n iid observations from an unknown distribution about which we only know that it possesses expected value µ and variance σ 2 (by the way, are we considering here a parametric or a non parametric model?) • In this setting we know that the arithmetic mean is an unbiased estimate of µ. • By recourse to the usual properties of the variance operator we find that the variance of the arithmetic mean is σ 2 /n. • If (as it is very frequently the case) σ 2 is unknown, even after observing the sample we cannot give the value of the sampling standard deviation. 19.116 Estimation of the Sampling Variance • We may estimate the numerator of the sampling variance: σ 2 (typically using the sample variance, with n or better n − 1 as a denominator) and we usually report the square root of the estimated sampling variance. • Remember: this is an estimate of the sampling standard error, hence, it too is affected by sampling error (in widely used statistical softwares, invariably, we see the definition “standard deviation of the estimate” in the place of “estimated standard deviation of the estimate”: this is not due to ignorance of the software authors, just to the need for brevity, but could be misleading for less knowledgeable software users). 19.117 nσ Rules • In order to give a direct joint picture of estimate an its (estimated) standard deviation, nσ “rules” are often followed by practitioners. • They typically report “intervals” of the form Point Estimate ±n Estimated Standard Deviation. A popular value of n outside Finance is 2, in finance we see value of up to 6. • A way of understanding this use is as follows: if we accept the two false premises that the estimate is equal to its expected value and this is equal to the unknown parameter and that the sampling variance is the true variance of the estimate, then Tchebicev inequality assign a probability of at least .75 to observations of the estimate in other similar samples which are inside the ” ± 2σ” interval (more than .97 for the ” ± 6σ” interval). 265 19.118 Confidence Intervals • A slightly more refined but much more theoretically requiring behavior is that of computing “confidence intervals” for parameter estimates. • The theory of confidence intervals typically developed in undergraduate courses of Statistics is quite scant. • The proper definition is usually not even given and only one or two simple examples are reported but with no precise statement of the required hypotheses. 19.119 Confidence Intervals • These examples are usually derived in the context of simple random sampling (iid observations) from a Gaussian distribution and confidence intervals for the unknown expected value are provided which are valid in the two cases of known and unknown variance. • In the first case the formula is and in the second √ x ± z1−α/2 σ/ n √ x ± tn−1,1−α/2 σ b/ n where z1−α/2 is the quantile in the standard Gaussian distribution which leaves on its left a probability of 1 − α/2 and tn−1,1−α/2 is the analogous quantile for the T distribution with n − 1 degrees of freedom. 19.120 Confidence Intervals • With the exception of the more specific choice for the “sigma multiplier” these two intervals are very similar to the “rule of thumb” intervals we introduced above. • In fact it turns out that, if α is equal to .05, the z in the first interval is equal to 1.96, and, for n greater than, say, 30, the t in the second formula is roughly 2. 19.121 Hypothesis testing • The need of choosing actions when the consequences of these are only partly known, is pervasive in any human endeavor. However few fields display this need in such a simple and clear way as the field of finance. 266 • Consequently almost the full set of normative tools of statistical decision theory have been applied to financial problems and with considerable success, when used as normative tools (much less success, if any, was encountered by attempts to use such tools in the description of actual empirical human behavior. But this has to be expected). 19.122 Parametric Hypothesis • Statistical hypothesis testing is a very specific an simple decision procedure. It is appropriate in some context and the most important thing to learn, apart from technicalities, is the kind of context it is appropriate for • Statistical hypothesis. Here we consider only parametric hypotheses. Given a parametric model, a parametric hypothesis is simply the assumption that the parameter of interest, θ lies in some subset Θi ∈ Θ. 19.123 Two Hypotheses • In a standard hypothesis testing, we confront two hypotheses of this kind (θ ∈ Θ0 , θ ∈ Θ1 ) with the requirement that, wrt the parameter space, they should be exclusive (they cannot be both true at the same time) and exhaustive (they cover the full parameter space. • So, for instance, if you are considering a Gaussian model and your two hypotheses are that the expected value is either 1 or 2, this means, implicitly, that no other values are allowed. 19.124 Simple and Composite • A statistical hypothesis is called “simple” if it completely specifies the distribution of the observables, it is called “composite” if it specifies a set of possible distributions. the two hypotheses are termed “null” (H0 ) hypothesis and “alternative” hypothesis (H1 ). • The reason of the names lies in the fact that, in the traditional setting where testing theory was developed, the “null” hypothesis corresponds to some conservative statement whose acceptance would not imply a change of behavior in the researcher while the “alternative” hypothesis would have implied, if accepted, a change of behavior. 267 19.125 Example • The simplest example is that of testing a new medicine or medical treatment. • In a very stylized setting, let us suppose we are considering substituting and already established and reasonably working treatment for some illness with a new one. • This is to be made on the basis of the observation of some clinical parameter in a population. • We know enough as to be able to state that the observed characteristic is distributed in a given way if the new treatment is not better than the old one and in a different way if this is not the case. • In this example the distribution under the hypothesis that the new treatment is not better than the old shall be taken as the null hypothesis and the other as the alternative. 19.126 Critical Region, Acceptance Region • The solution to a testing problem is a partition of the set of possible samples into two subsets. If the actually observed sample falls in the acceptance region x ∈ A we are going to accept the null, if it falls in the rejection or critical region x ∈ C we reject it. • We assume that the union of the two hypotheses cover the full set of possible samples (the sample space) while the intersection is empty (they are exclusive). this is similar to what is asked to the hypotheses wrt the parameter space but has nothing to do with it. • The critical region stands to testing theory in the same relation as the estimate is to estimation theory. 19.127 Errors of First and Second Kind • Two errors are possible: 1. x ∈ C but the true hypothesis is H0 , this is called error of the first kind; 2. x ∈ A but the true hypothesis is H1 , this is called error of the second kind. • We should like to avoid these errors, however, obviously, we do not even know (except in toy situations) whether we are committing them, just like we do not know how much wrong our point estimates are. 268 • Proceeding in a similar way as we did in estimation theory we define some measure of error. 19.128 Power Function and Size of the Errors • Power function and size of the two errors. Given a critical region C, for each θ ∈ Θ0 ∪ Θ1 (which sometimes but not always corresponds to the full parameter space Θ) we compute ΠC (θ) = P (x ∈ C; θ) that is the probability, as a function of θ, of observing a sample in the critical region, so that we reject H0 . • We would like, ideally, this function to be near 1 for θ ∈ Θ1 while we would like this to be near 0 for θ ∈ Θ0 . • We define α = sup ΠC (θ) the (maximum) size of the error of the first kind and θ∈Θ0 β = sup (1 − ΠC (θ)) the (maximum) size of the error of the second kind. θ∈Θ1 19.129 Testing Strategy • There are many reasonable possible requirements for the size of the two errors we would the critical region to satisfy. • The choice made in standard testing theory is somewhat strange: we set α to an arbitrary (typically small) value and we try to find the critical region that, given that (or a smaller) size of the error of the first kind, minimize (among the possible critical regions) the error of the second kind. • The reason of this choice is to be found in the traditional setting described above. If accepting the null means to continue in some standard and reasonably successful therapy, it could be sensible to require a small probability of rejecting this hypothesis when it is true and it could be considered as acceptable a possibly big error of the second kind. 19.130 Asymmetry The reader should consider the fact that this very asymmetric setting is not the most common in applications. 19.131 Some Tests • One sided hypotheses for the expected value in the Gaussian setting. Suppose we have an iid sample from a Gaussian random variable with expected value µ and standard deviation σ. 269 • We want to test H0 : µ ≤ a against H1 : µ ≥ b where a ≤ b are two given real numbers. It is reasonable to expect that a critical region of the shape: C : {x : x > k} should be a good one. • The problem is to find k. 19.132 Some Tests • Suppose first σ is known. The power function of this critical region is (we use the properties of the Gaussian under standardization): ΠC (θ) = P (x ∈ C; θ) = P (x > k; µ, σ) = 1 − P ( = 1 − Φ( x−µ k−µ √ ≤ √ )= σ/ n σ/ n k−µ √ ) σ/ n • Where Φ is the usual cumulative distribution of the standard Gaussian distribution. 19.133 Some Tests • Since this is decreasing in µ the power function is increasing in µ, hence, its maximum value in the null hypothesis region is for µ = a. • We want to set this maximum size of the error of the first kind to a given value k−a √ ) = α so that k−a √ = z1−α so that k = a + √σ z1−α . α so we want: 1 − Φ( σ/ n σ/ n n • When the variance is unknown the critical region is of the same shape but k = b and t are as defined above. a + √σbn tn−1,1−α where σ 19.134 Some Tests The reader should solve the same problem when the hypotheses are reversed and compare the solutions. 19.135 Some Tests • Two sided hypotheses for the expected value in the Gaussian setting and confidence intervals. • By construction the confidence interval for µ (with known variance): √ x ± z1−α/2 σ/ n contains µ with probability (independent on µ) equal to 1 − α. 270 • Suppose we have H0 : µ = µ0 and H1 : µ 6= µ0 for some given µ0 . The above recalled property of the confidence interval implies that the probability with which √ x ± z1−α/2 σ/ n contains µ0 , when H0 is true, is 1 − α. 19.136 Some Tests √ • The critical region: C : x : µ ∈ / x ± z σ/ n or, that is the same: C : 0 1−α/2 √ x:x∈ / µ0 ± z1−α/2 σ/ n has only α probability of rejecting H0 when H0 is true. • Build the analogous region in the case of unknown variance and consider the setting where you swap the hypotheses. 271 20 20.1 *Taylor formula in finance (not for the exam) *Taylor’s theorem. Let k ≥ 1 be an integer and let the function f : R → R be k times differentiable at the point a ∈ R. Then there exists a function hk : R → R such that f (x) = f (a) + f 0 (a)(x − a) + f 00 (a) f (k) (a) (x − a)2 + · · · + (x − a)k + hk (x)(x − a)k 2! k! and lim hk (x) = 0. x→a The last term in the formula is called the Peano form of the remainder. The Peano remainder only tells us that, if we define: Pk (x) = f (a) + f 0 (a)(x − (k) 00 a) + f 2!(a) (x − a)2 + · · · + f k!(a) (x − a)k the Taylor polynomial of the function f at the point a, and: Rk (x) = f (x) − Pk (x) the remainder term, we have that Rk (x) = Rk (x) o(|x − a|k ), x → a that is limx→a |x−a| k = 0. Notice that there is no reason to call Pk (x) the “Asymptotic Best Fit” of a polynomial to f (x) over any interval centered in a and, in fact: no specific interval and no error measure to be minimized were defined in order to define the Taylor polynomial. There are several ways to fit polynomials to functions over intervals which give in many sense “best fit” than Taylor formula. The precise meaning of Taylor formula shall be clearer if you study the proof of the formula sketched below. Moreover, the result of Taylor’s theorem only tells us something about the speed with which the remainder goes to 0 when we consider x → a. In fact the Peano form of the remainder tells us nothing about the size of the remainder for a given x not equal to a. 20.2 *Remainder term In order to make more precise statements about the size of the remainder term over an interval of interest we need stronger hypotheses. Two famous results are as follows. Lagrange form: Let f : R → R be k + 1 times differentiable on the open interval and continuous on the closed interval between a and x. Then (k+1) (ξ ) L (x − a)k+1 for some real number ξL between a and x. Rk (x) = f (k+1)! Cauchy form: (k+1) Rk (x) = f k! (ξC ) (x − ξC )k (x − a) for some real number ξC between a and x. Notice that the error term is for a given x. ξL shall change if you change x. However the result is still very useful as it allows for bounding the remainder over any given interval by maximizing it over the interval (see the example below). 272 20.3 *Proof There is a very nice proof of Taylor theorem plus Lagrange remainder, which is only algebraic (that is: it does not require limit statements and is only based on algebraic operations) once you suppose the required derivatives exist. In fact, this proof is a streamlined version of Lagrange’s own proof. Suppose you can write a function f (x + h) as f (x + h) = a0 + a1 h + a2 h2 + ... + an hn + ah hn+1 if f is bounded in an interval around x, it is obviously always possible to get the equality since ah depends on h. Now, fix some h∗ and the corresponding ah∗ . The function f (x + h) − (a0 + a1 h + a2 h2 + ... + an hn + ah∗ hn+1 ) shall be equal to 0 if h = h∗ and, if we set ao = f (x) it shall be equal to 0 also when h = 0. However, if a function is equal to 0 in two points and it is differentiable in between (and hypothesis we make) then the first derivative of the function must be 0 for some point, say h0 between 0 and h∗ (this is Rolle’s theorem, by Michael Rolle, France, 1652-1719, a contemporary of King Louis XIV, and it is a less obvious and much more powerful result than it seems). The first derivative of the function is f 0 (x + h) − (a1 + 2a2 h + ... + nan hn−1 + (n + 1)ah∗ hn ) and if we set a1 = f 0 (x) = f (1) (x) this function is equal to 0 both at h = 0 and h = h0 . We can then repeat the argument: there must exist some point h1 between 0 and 00 h0 where the derivative of this function is 0, and if we set 2a2 = f (x) = f (2) (x) the equality to 0 of the derivative is true also for h = 0. So there must exist a point between 0 and h1 where the derivative of this derivative is zero... and so on. After repeating this n times we get f (n) (x + h) − (n!an + (n + 1)!ah∗ h) There must be some point hn between 0 and hn−1 where this is 0 and if we set (n) an = f n!(x) the function shall be equal to 0 also if h = 0. We take a derivative more and we get f (n+1) (x + h) − (n + 1)!ah∗ which, must be zero for some hn+1 between 0 and hn so that ah∗ = f (n+1) (x + hn+1 ) (n + 1)! 273 In the end, we have found that, for any value of h, there exists an Hh (the hn+1 of the proof) between 0 and h such that f (n) (x) n f (n+1) (x + Hh ) (n+1) h + h n! (n + 1)! where the notation should help to stress the dependence of the “coefficient” of the last (n+1) (x+H ) h term ( f (n+1)! ) on the specific point h (one can avoid the dependence if and only if the function to be approximated is itself a n + 1 degree polynomial). f (x + h) = f (x) + f (1) (x)h + ... + 20.4 *Taylor formula and Taylor series When the function f is infinitely differentiable around a we can legitimately consider Pk (x) for k → ∞ as the Taylor series generated by f . There is, however, no reason why this series in general should converge and, if it converges, that the convergence should be to f . In fact when these two properties hold we say that f is a member in a very important class of functions: analytic functions. It is quite possible that the Taylor series generated by a given function (necessarily not analytic) is identical to the Taylor series of another function and if this is analytic, then the series shall converge to it. 1 A standard example when this happens is the function f (x) = e− x2 . This is infinitely differentiable in x = 0. If we compute any f (k) (0) this is always equal to 0 so that the Taylor polynomial converges and it is always zero on any interval, the remainder is, obviously, the functions itself which is o(|x − a|k ) for any k ≥ 0 because it goes to 0 for x → a faster that any power. As we see Taylor theorem is satisfied for any k, the Taylor series converges but it converges to the null function. The null function is, obviously, analytic and its Taylor polynomials (not the remainders!) are all identical to those of f (x). Clearly the two functions are different if x 6= 0. 20.5 *Taylor formula for functions of several variables The general notation is quite cumbersome: Let |α| = α1 + · · · + αn , α! = α1 ! · · · αn !, xα = xα1 1 · · · xαnn for α ∈ N n and x ∈ Rn . If all the k-th order partial derivatives of f : Rn → R are continuous at a ∈ Rn , then one can change the order of mixed derivatives at a, so the notation |α| f |α| ≤ k for the higher order partial derivatives is not ambiguDα f = ∂xα∂1 ···∂x αn , n 1 ous. The same is true if all the (k − 1)−th order partial derivatives of f exist in some neighborhood of a and are differentiable at a. Then we say that f is k times differentiable at the point a . Multivariate version of Taylor’s theorem. Let f : Rn → R be a k times differentiable function at the point a ∈ Rn . Then there exists hα : Rn → R such that 274 f (x) = X X Dα f (a) (x − a)α + hα (x)(x − a)α , α! and (1) |α|=k |α|≤k lim hα (x) = 0. (2) x→a The remainder term can be written in the Lagrange form as X Dα f (ξL ) (x − a)α α! |α|=k+1 Where ξL = a + (x − a)cL and cL is a scalar with cL ∈ (0, 1). In words: ξL is a vector starting from a in the direction of x whose length is a fraction c of the distance between a and x . This is the multidimensional analogue of the sentence: “for some real number ξL between a and x” used in defining remainders in the one dimensional case. If you suppose n = 2 and α = 2, the Taylor formula amounts to: f (x1 , x2 ) = f (a1 , a2 ) + fx1 (a1 , a2 )(x1 − a1 ) + fx2 (a1 , a2 )(x2 − a2 )+ 1 1 +fx1 .x2 (a1 , a2 )(x1 − a1 )(x2 − a2 ) + fx21 (a1 , a2 )(x1 − a1 )2 + fx22 (a1 , a2 )(x2 − a2 )2 + 2 2 +h1,1 (x1 , x2 )(x1 − a1 )(x2 − a2 ) + h2,0 (x1 , x2 )(x1 − a1 )2 + h0,2 (x1 , x2 )(x2 − a2 )2 20.6 *Simple examples of Taylor formula and Taylor theorem in quantitative Economics and Finance We can do beautiful and easy things with polynomials which we cannot do so easily with other functions. For instance: it is easy to differentiate or integrate polynomials and results of these operations are again polynomials. It is easy to find expected values of polynomials as these only involve moments. Sums, differences and products of polynomials are still polynomials71 . The “taking the ratio” operation requires some more precision: polynomials are a special case of a larger class of functions called “rational functions”. Rational functions are functions which can be written as ratios of polynomials (if the denominator is 1 we are back with polynomials). Rational functions are closed to the operations listed in the text and to ratios. Moreover integrals and derivatives are easy to compute. 71 275 Finally, two polynomials are equal if and only if they have identical coefficients for terms with the same power. There is also a famous result, the Stone-Weierstrass theorem, that tells us this: for any continuous function f (x) on a closed interval [a, b]and any number > 0 there exist a polynomial gf, (x) such that supx∈[a,b] |f (x) − gf, (x)| < . That is: we can approximate the given function with a polynomial over an interval with maximum error . We can actually build such polynomial (if we know how to compute f (x)) using an interesting tool which is borderline between Probability and Analysis called “Bernstein Polynomial”. This result implies that polynomials are, in a precise (and to be understood precisely!) sense, all we do need if we are modeling phenomena using continuous functions on bounded intervals. It is then enticing, when we do algebra with functions, that is: when we try to figure out the implication of our mathematical models, to try and approximate the functions of interest with polynomials before acting on them. The problem here is that, even if the initial approximation is good, it may well be that even simple operations on the functions amplify the size pf the error. For instance, the second order approximations near x = 0 of the polynomials x + x2 + .00001x3 and x + x2 are identical and are x + x2 . For the second polynomial this is perfect and for the first is very good, if x is not too far from 0. However if we take even the simple difference of the exact functions and of their approximation, we find two different results (.00001x3 and 0). Not much, but: suppose this difference, further on in our modeling, is multiplied by a big number, or maybe, this is a part of a formula and you are interested in some limit for x going to infinity. The possible problems are clear. In the following examples we shall consider some simple financial applications of polynomial approximations based on Taylor formula. Many other polynomial based approximation method exist (we mentioned Bernstein polynomials. Another very important approximation toolbox in applied Mathematics is that based on orthogonal polynomials) We warned the Reader should be warned that this is a very tricky job: approximations may work for some purpose and dramatically fail for others. Conditions and rules for a cautious use of the trick exist and a detailed study of these is required before an independent foray in the use of polynomial approximations in general and of Taylor approximations in particular. 20.7 *Linear and log returns, a reconsideration As discussed at the beginning of these notes, in finance the evolution of securities values, for instance in the case of stock prices, is often modeled in terms of returns not of prices. We do not directly model the evolution of prices over time (which is, 276 obviously, our main interest) but the return process and from this, if necessary, we derive price behavior. The reason for this very peculiar attitude is partially similar to the reason behind the fact that physical models of motion are usually written in terms of acceleration and not of speed or position: the Mathematics is simpler. In simple physical models accelerations are proportional to forces (Newton second law, the proportion being given by the reciprocal of mass) and forces “add” in a simple way, in finance returns, properly defined, still “add in a simple way” (and the analogy ends here). Moreover, as a first approximation (here not in Taylor sense!) it is an empirical fact that returns (again, properly defined) can be considered as statistically identically distributed and independent over time, while this is not true for price levels and price differences. For price levels this is obvious, for prices differences it is another empirical fact that, as a rule, the variance depends on the level of price. It is much easier to start modeling independent random variables and then derive models for variables that are functions of these, hence the choice of returns. The problem is what is the “right” definition of return. Here we have to optimize a trade off between the “natural and intuitive” definition of return, which existed well before quantitative finance, and a definition of return easier to work with in mathematical terms (the “properly defined” proviso above). When we wish to define return we must first consider if only to take into account price evolution or also consider, for instance, dividends and similar actions on the firm’s capital for a stock, or coupons for a bond (here we shall stick to the stock case as an example). Secondly, we need to choose a particular formula for the return definition. Suppose we are interested in a time period between t and t + 1. At time t the price of a share of stock is Pt and at time t + 1it is Pt+1 . Moreover, between the two dates some dividend was distributed, let us call this Dt (we are supposing that the dividend is known at t but is distributed at t + 1with no accrual so that Pt is the “cum dividend” price). − 1while a very A very simple definition of “percentage price return” is rt+1 = PPt+1 t Pt+1 +Dt simple definition of “percentage total return” is Rt+1 = − 1. It is clear that, Pt from the point of view of the “financial meaning” of the result, while often it is the first formula to be used in financial newspapers, the second one is the more appropriate to express the “percentage gain” of holding the share between t and t + 1. This is a very simple definition with a very simple interpretation in terms of percent gain. It is not always a very useful definition if we want to model returns. The problem is that simple percentage (sometimes called “linear” ) returns (price only or total) do not add over time. In fact, with price returns rt+2,t = Pt+2 Pt+2 − Pt+1 + Pt+1 Pt+2 − Pt+1 Pt+1 Pt+1 − Pt −1= −1= + = Pt Pt Pt+1 Pt Pt 277 = Pt+2 Pt+1 Pt+1 Pt+1 − + − 1 = (1 + rt )(1 + rt+1 ) − 1 Pt+1 Pt Pt Pt or, better, in terms of “gross returns” 1 + rt+2,t = (1 + rt+2 )(1 + rt+1 ), that is PPt+2 = t Pt+2 Pt+1 . Pt+1 Pt With total returns the mess may be even worse depending on how we deal with period dividends in defining multi period returns: do we simply add them or consider accruals? According to the first possibility we get 1+Rt+2,t = Pt+2 + Dt+1 Pt Dt Pt+2 + Dt+1 Pt+1 + Dt − Dt Dt Pt+2 + Dt+1 + Dt = + = + = Pt Pt+1 Pt Pt Pt+1 Pt Pt = (1 + Rt+2 )(1 + Rt+1 ) − (1 + Rt+2 ) Dt Dt Dt + = (1 + Rt+2 )(1 + Rt+1 ) − Rt+2 Pt Pt Pt t (1+Rt+2 ) If, instead, we define 1 + Rt+2,t = Pt+2 +Dt+1 +D we get 1 + Rt+2,t = (1 + Pt Rt+2 )(1 + Rt+1 ) . A possible defense, beyond simple opportunity, of this “with accruals” definition, is t that difference with the no accrual is just given by term Rt+2 D which is likely to be Pt small. Beware! this is going to be true only for short time spans. The mathematical solution for the additivity problem, when considering gross price ∗ = return or total return with accrual, is immediate: take the (natural) logarithm rt+2,t ∗ ∗ ∗ ∗ ∗ ln(1 + rt+2,t ) = rt+2 + rt+1 or rt+2,t = ln(1 + rt+2,t ) = rt+2 + rt . What happens with total returns? The problem is in the definition of total return over many time periods. If we define total return over one time period as we did: ∗ Rt+1 = ln Pt+1 + Dt Pt for total return over two time periods we could define: ∗ Rt;t+2 = ln Pt+2 + Dt+1 + Dt Pt+2 + Dt+1 Pt+1 + Dt 6= ln + ln Pt Pt+1 Pt So that no simple aggregation holds. However it is clear that in the previous definition we do not consider dividend reinvested. Let us suppose than dividend Dt is paid at time t + 1 (and the same for the other dividends). Between time t + 1 and t + 2 this dividend is reinvested in the same stock so that at time t + 2 its value is Dt (Pt+2 + Dt+1 )/Pt+1 . Keeping this in mind we define ∗ Rt;t+2 = ln Pt+2 + Dt+1 + Dt (Pt+2 + Dt+1 )/Pt+1 (Pt+2 + Dt+1 )Pt+1 + Dt (Pt+2 + Dt+1 ) = ln = Pt Pt Pt+1 278 = ln (Pt+2 + Dt+1 )(Pt+1 + Dt ) Pt+2 + Dt+1 Pt+1 + Dt = ln + ln Pt Pt+1 Pt+1 Pt hence, if we take into account dividend reinvestment according to this convention (other exist but only this one gives the required result) we have that time additivity holds not only for simple (or price) log returns but also for total log returns. It is to be noticed that this formula requires the knowledge of Pt+1 to compute the two period return, while the no compounding formula does not. This means that we can compute time additive total log returns, however in order to do so we need intermediate prices or, at least, capitalized dividends. In practice multi period returns can only be computed by adding single period returns while, in the simple log return case, it is possible to compute many periods log returns by simply knowing initial and terminal price. 20.8 *Taylor theorem and the connection between linear and log returns Now Taylor’s theorem. Can it help us in assessing how much we are wrong if we take log returns in the place of linear returns and if we suppose that log total returns are the sum of log price returns and log dividend price ratio? Let us start with log price returns. ∗ if we suppose that the ratio of the two prices is near 1 (that is: if we rt+1 = ln PPt+1 t allow for a short time span between the two prices and suppose non significant new information arrives in the interval) we can approximate the log return around x = 1 . when x = PPt+1 t It is obvious that in any open interval including x = 1 and non including x = 0 we have that ln x is differentiable any number of times so that we can define its Taylor series as: (x − 1)2 (x − 1)3 + − ... 2! 3! ∗ ≈ rt+1 . If we truncate this expansion to the first order we get ln x ≈ x−1 so that rt+1 What happens for total returns? It is clear that the above argument still holds simply taking x = (Pt+1 + Dt )/Pt . In both cases, as we did see in the first chapters of these handouts, since x − 1is tangent to ln x when x = 1 (as the two functions have the same value 0 and the same derivative 1 at that point) and the second derivative of ln x is always negative, we have that x − 1 ≥ ln x so that the linear return shall never be smaller than the log return. ln x = 0 + 1(x − 1) − 20.9 *How big is the error? This depends on how far x is from 1 that is: how far price or price plus dividend moved from past price. 279 (k+1) (ξL ) According to the Lagrange remainder formula we have Rk (x) = f (k+1)! (x − a)k+1 which, in our case, becomes R1 (x) = − ξ21 2 (x − 1)2 . L We see at once that the error is always negative, that is, the log return cannot be bigger that the linear return. Moreover, it is always going to behave (locally) as a 1 2 parabola and, for x in any interval 0 < α < β shall be bounded by − 2 min(α 2 ,1) (x − 1) (obviously we can compute it to any precision for any value of x, but the result is still useful for a clear understanding of the approximation properties). To have an idea, for a change of ±10% in price, the bound just given assess that .01 = −.006172 . The actual error for a a the error shall be less than − 2.91 2 (.1)2 = − 1.62 10% change in price is -.00536 when the change is downward. For a 10% increase in price the error is -.00469. In the end, is the error negligible? It obviously depends on the likelihood of big changes in price (price plus dividend) in the time interval under consideration and on the precision we want to achieve with our model. 20.10 *Gordon model and Campbell-Shiller approximation. Let us now consider a related important topic. t +Dt t t = ∆PPt+1 +D = rt+1 + D , that If we do simple algebra we get Rt+1 = Pt+1 −P Pt Pt Pt t Dt is: the price return plus the dividend to price ratio. In other words Pt = Rt+1 − rt+1 . This, ex post, is an identity and we can take the expected value conditional on t and t get D = Et (Rt+1 − rt+1 ). Pt Still an identity, not a model. That is: these relations are always true and put no constraint on the observable variables. In fact this identity is a little bit of a cheat t = Rt+1 − rt+1 implies that Rt+1 − rt+1 is non as, conditional on t, the identity D Pt stochastic as, obviously, it does not contain the only stochastic (given t) element in the t + 1 returns. That is: Pt+1 . However, if we do not condition on t, this becomes a relevant identity because it says that, whatever the model, the dividend price ratio at time t, now a random variable as we are not conditioning on t any more, is identical to the (now random) conditional on t expected value of future excess returns, so that any model of future excess returns expectations is a model of the dividend price ratio. Any change in expectations for future excess returns must imply a change in dividend price ratio. Let’s create a model. The simplest example is the Gordon model. This is a very old idea which still is the groundwork for much “self understanding” on the part of companies and investors in the market. The starting point is the idea that the market price of a company (and to be simple suppose that this is equal to the price of its stock) must be the same as the flow of 280 its future dividends discounted using as discount rate the cost of capital r (supposed constant) for the company. In order to get a simple formula we shall suppose that the cost of capital is constant and that future dividends are equal to current dividend plus P (1+g)i−1 a constant percentage dividend growth g. We have then Pt = ∞ i=1 Dt (1+r)i (here we implicitly suppose that the first dividend Dt is paid at t + 1). P∞ i θ From this we get (using the standard result for power series: i=1 θ = 1−θ ; 0 ≤ θ < 1, and supposing g < r) Dt =r−g Pt Recalling now our previous result, this implies that the excess expected return is roughly a constant Et (Rt+1 − rt+1 ) = r − g given by the difference between the cost of capital and the rate of growth of dividends. Constant because in the Gordon model we suppose these two parameters to be constant. The result or, more properly, the hypothesis, is surprisingly powerful, if taken seriously. For instance, suppose you have an hypothesis about dividends and know the price of the company, the formula gives you a way of computing the cost of capital compatible with the current price and the hypothesis on dividends so that, if you (or the company) wish to enter some investment at some cost, you may think reasonable compare the dividends you expect from the investment with its cost in order to guess if this shall increase or decrease the price (value) of the company. A second implication is that, in absence of structural change, the dividend price ratio should be almost constant so that if you can forecast dividends you can also forecast prices. While this is a simple rewording of the hypothesis of the model, the rather obvious conclusion (we repeat: to be honest this is actually an hypothesis) is the implicit basis of many Corporate Finance “folk” reasoning. Another simple rewording of the hypothesis is the common notion that dividend growth cannot be bigger (at least for an indefinite amount of time) than cost of capital. Formally, this is nothing but the condition for convergence of the power series but has an interesting interpretation when we think to “bubbles” which, in this simple interpretation, could simply come from the idea that for some “new kind of company” in some “new kind of world” dividend growth shall always be bigger that cost of capital. This is a corporate equivalent to changing lead into gold. In the related academic literature the condition for the convergence of the power series takes, not surprisingly, the name of “condition for the absence of rational bubbles”. But beware! In Economics sometimes gold becomes lead and lead gold and this with no need of atomic reactions. Time and history suffice. Sometimes gold becomes lead: think for instance to the history of the economic value of salt. Today you put it in dish washing machines, not so long time ago they used to fight wars over it. Sometimes lead becomes gold. Think about oil for instance. It used to be a big nuisance as it fouled otherwise useful farming fields, now things are quite different. Countries like the USA are purposely destroying water reserves to recover it from shales. 281 In fact each time markets are in a bubble, examples as these are quoted by both the party of reason and the party of exuberance. Now let us move a step further. Gordon model is written in terms of linear returns. What happens if we start with log total returns? t , something The first possible line of attack is to say that, since Rt+1 = rt+1 + D Pt D ∗ ∗ t like Rt+1 ≈ rt+1 + ln Pt should be true Here things get tricky. This formula cannot work, even approximately. At a glance we see that, while a value near one for the ratio of consecutive prices is reasonable, so ∗ may be true, xand ln x are always very different. We that something like rt+1 ≈ rt+1 are sure that our x cannot be negative, but it shall be much smaller that 1 (except in very exceptional cases) so that its logarithm shall always be negative. . We must be careful. In a widely quoted paper by Campbell and Shiller (“The dividend price ratio and expectations of future dividends and discount factors”, Review of Financial Studies, 1988) we find the following result ln Pt+1 + Dt Pt ∗ = Rt+1 ≈ k − ρ ln Dt−1 Dt Dt + ln + ln = k − ρδt+1 + δt + ∆dt Pt+1 Pt Dt−1 Notice that, while different than the above naive approximation, also in this case we have logarithms of ratios which could be near or equal to zero. The Authors do not give a rigorous proof of this approximation and state that in order to get it one must use Taylor formula to the first order for the total log return written as a function of δt+1, δt and ∆dt+1 where the expansion is given at the point δt+1 = δt = δ and ∆dt+1 = g (a constant). Let us try and justify such an argument: the heuristic derivation given by the authors in the paper does not allow for any proof of the quality of the approximation and, in fact, the Authors only present a numerical study of the approximation itself. In fact it is not difficult to provide a rigorous derivation. First let us write the log total return as a function of the required variables. Pt+1 + Dt = ln(Pt+1 + Dt ) − ln Pt = ln(Pt+1 + Dt ) − pt = ln Pt Dt Dt = ln Pt+1 (1 + ) − pt = ln 1 + + pt+1 − pt = Pt+1 Pt+1 = ln 1 + eδt+1 + pt+1 − pt Notice that, simply by writing δt+1 = ln(Dt /Pt+1 ) we suppose Dt > 0, this is important for what follows. Now let us expand ln 1 + eδt+1 using Taylor formula at the first order at the point δt+1 = δ. 282 ln 1 + eδt+1 ≈ ln 1 + eδ + eδ (δt+1 − δ) 1 + eδ so that ∗ ≈ ln 1 + eδ + Rt+1 = ln 1 + eδ + eδ (δt+1 − δ) + pt+1 − pt = 1 + eδ eδ (δt+1 − δ) + pt+1 + dt − dt + dt−1 − dt−1 − pt = 1 + eδ eδ eδ δ + δt+1 − δt+1 + δt + dt − dt−1 = 1 + eδ 1 + eδ eδ eδ δ = ln 1 + e − δt+1 + δt + dt − dt−1 δ− 1− 1 + eδ 1 + eδ 1 eδ which is of the required form with k = ln 1 + eδ − 1+e δ δ and ρ = 1+eδ . Remember t+1 that the approximation only involves the ln(1 + eδ ) term, the rest of the formula is exact. ∗ ∗ t + ln D ≈ rt+1 Recalling the hope of writing something like: Rt+1 we can slightly Pt modify the argument in the following way = ln 1 + eδ − ∗ Rt+1 = ln( D Pt+1 Dt ∗ ln t + ) = ln(ert+1 + e Pt ) Pt Pt ∗ t = 0 and ln D if we now expand this around rt+1 = δ we get Pt ∗ Rt+1 ≈ ln(1 + eδ ) + 1 eδ Dt ∗ r + (ln − δ) t+1 1 + eδ 1 + eδ Pt which is of the form ∗ ∗ Rt+1 ≈ a + brt+1 + c ln Dt Pt ∗ ∗ t Which is as near as we can go to the naive and wrong idea Rt+1 ≈ rt+1 + ln D and, Pt yes, while we can expect a very near 0 and b very near 1, we can also expect c very near 0 (why all this?) so that the naive idea is VERY wrong. Notice in the very unlikely case when dividends are expected to be HUGE with respect to Pt , that is similar in value to Pt , ln(Dt /Pt ) ≈ 0, but, in this case a ≈ ln 2 − 1/2 ≈ .193 and b ≈ 1/2. Notice also that both this approximation and the original Campbell and Shiller approximation depend on the hypothesis of non null dividends. In fact they break down if dividends are (and is quite a possibility) very near 0 (in case of 0 dividends the approximations are not even defined). For instance: in this case we should have ∗ ∗ ∗ ∗ t Rt+1 = rt+1 while the approximation Rt+1 ≈ a + brt+1 + c ln D is going to give us a Pt 283 very negative and meaningless number. On the other hand the Campbell and Shiller approximation shall give either meaningless negative or positive values depending on which between δt and δt+1 corresponds to almost null dividends (and if they both are...). For this reason the expansion should not be used over short time periods (small probability of dividend) ad it is better used for stock portfolios that for stocks (at least some stock in the portfolio should yield dividends for each period if the time period is not too short). Obviously, always recall that log returns do not add across different stocks. Let us now go back to Campbell and Shiller approximation. Notice that k and ρ depend on the point around which we do expand the log return using Taylor. This implies that, in what follows, if we intend to use the formula by, for instance, iterating it or summing it over different times, we must assume that the point where the expansion is run is constant and, in particular, does not depend on δt . Keep this in mind in order to understand what follows. Above we showed that log total returns are additive over time if we consider reinvestment of the dividends in the stock. We are now approximating log returns and could hope that, while an approximation, the Taylor expansion is additive, that is: the Tailor approximation of the many period log total return is identical to the sum of the Taylor approximations of one period log total return. This is obviously true, as the log total return with reinvested dividends is equal to the sum of the two one period log total returns and the two variable Taylor formula at the first order does not require mixed derivatives. However we need some care with the expansion point. Our one period return expansion is in terms of δt+1 . We now have δt+1 and δt+2 . We have a function of two variables. In order to compare approximations we need to expand the two variable function around a point with identical coordinates, let us say δ. Let us begin with the sum of the two expansions for the two one period returns. ln Pt+2 + Dt+1 Pt+1 + Dt + ln ≈ k − ρδt+2 + δt+1 + dt+1 − dt + k − ρδt+1 + δt + dt − dt−1 = Pt+1 Pt = 2k − ρ(δt+2 + δt+1 ) + δt+1 + δt + dt+1 − dt−1 where 2k = ln(1 + eδ )2 − by writing eδ 2δ 1+eδ and ρ = 1 . 1+eδ This is exactly what we would get ∗ Rt;t+2 = ln(1 + eδt+2 )(1 + eδt+1 ) + pt+2 − pt and expanding the log part to the first order around the point of coordinates [δt+2 = δ; δt+1 = δ]. In the end: if we use the convention of reinvesting dividends in the stock and suppose to do this on a period by period basis, log total returns are additive over time and Taylor expansion keeps this linearity property. 284 This is quite expedient, from the point of view of simplicity remember, however, that, as already mentioned, if we define returns in this way and then proceed in building models for returns we must fully understand that returns so defined are not the original ratios of quantities. They are no more percentages and, overall, may have very different properties. For instance: percentage return have a lower bound (-1 that is -100%) while log return have no lower bound, so that the Gaussian distribution cannot be a satisfactory model for percentage returns while it can be a good model for log returns. 20.11 *Remainder term We are now in the position of computing the remainder term in the Lagrange form which shall be ∂ 2 ln 1 + eδt+1 δt+1 =δ L (δt+1 − δ)2 = 2! = e δL 1 (δt+1 − δ)2 ≤ (δt+1 − δ)2 δ 2 L (1 + e ) 2 8 (The maximum of x/(1 + x)2 is for x = 1 and is equal to 1/4). With an error of this size we are able to “take out linearly” δt+1 from the logarithm. ∗ . The rest of the Notice that no other approximation is made in the formula for rt+1 result only depends on adding and subtracting dt and dt−1 and reordering terms. Again: the error is always positive, the approximation always underestimates the true total return. The size of the error depends in a quadratic way from the distance of the log price dividend from the constant δ. If the log dividend price ratio is a stationary process with small variance and we choose δ = E(δt ) (obviously in view of stationarity this does not depend on t) the approximation is going to work very well. In fact in this case the expected value of the maximum error shall be: 1 1 1 E( (δt+1 − δ)2 ) = E( (δt+1 − E(δt+1 ))2 ) = V ar(δt+1 ) 8 8 8 However beware: this is true for each single step in the approximation, if we use the approximation for computing many terms and then summing them we are going to run into problems. In fact: since the error is always of the same sign for each single term, any summation is going to increase its overall size. Hence: be careful if you see (as shall be the case) this approximation used in any summation or series. It may work but care is needed. 20.12 *Dividend price model A common use of the approximation is as follows. 285 Suppose the approximation is actually exact so that ∗ = k − ρδt+1 + δt + ∆dt Rt+1 (With ∆dt = dt − dt−1 ). If we forget everything about the approximations (Errors and constants which are really not constants) this may be taken as a linear difference equation for δt ∗ δt = ρδt+1 − k − ∆dt + Rt+1 which can be solved iteratively forward in time (beware! recall the analysis we did a moment ago about repeated use of the approximate identity) as δt = ρ m+1 δt+m+1 + m X ∗ ρj (Rt+j+1 − ∆dt+j − k) j=0 If we suppose that the summation converges (in some sense) as m goes to infinity we have δt = ∞ X ∗ ρj (Rt+j+1 − ∆dt+j ) − j=0 k 1+ρ Notice that, for the properties of the approximation error, such a convergence would require, at the very least, for the approximation error to go to 0: we wrote the approximation supposing it equal to 0. But this would be possible only if the log dividend price ratio became a constant equal to δ. Now, put this worry in the back of your mind and take the expected value conditional on t of the two sides of this equation. We get δt = ∞ X ∗ ρj Et (Rt+j+1 − ∆dt+j ) − j=0 k 1+ρ As before, this formula, apart from the approximation, is still an identity but, if we use any model according to which the current log price dividend ratio is, apart from a constant, given by the sum of the discounted expected values (corrected for risk) of the future log dividend changes, we must accept that the proper discount rate is h = 1/ρ-1 (so that the discount factor 1/(1 + h) is equal to ρ) and the risk premium is given by ∗ Et (Rt+j−1 ). In other words, apart from the approximation error, this “identity” imposes constraints to the modeling of price dividend ratio, dividend growth and returns as, roughly, to model any two of these implies to model the third. A very imprecise but evocative way to read this statement is that you should, for instance, be able to forecast, at least partially, future returns if you have a model of dividend growth and you observe the price dividend ratio. 286 ∗ Moreover, if we assume Et (Rt+j+1 − ∆dt+j ) = q, some constant, we are back to the Gordon model, now in log return form (at least, this is what the Authors say, but some care should be taken dealing with ρ and k). Something more on the approximation. We should keep in mind that the linearity in the “difference” equation fully comes from the approximation and strictly requires that ρ be a constant. This is true only if we suppose the expansion point is always the same. So that, when we iterate the equation we are going to incur in different approximation errors for each iterate depending in how far the actual δt+1 is from the expansion point δ. 20.13 *What happens if we take the remainder seriously As an exercise in critical analysis of a Taylor formula application, let us think a little bit about the consequences of the approximation. Two points must be kept in mind. The first is that a Taylor approximation is a local approximation. That is: it is only based on the value of a function and of its derivatives at a point. The second is that the Taylor approximation is a polynomial approximation. Whatever the function to approximate the approximation is made with a polynomial. This implies that, the more the function and its derivative change in a given interval, the less the approximation shall work. (think, for instance, about the approximation of sin(1/x) for x > 0 but not so far from it. Moreover some properties of polynomials can be very different with respect to properties of the underlying function. For instance:sin(x) is bounded for any value of the argument (it oscillates between -1 and +1). On the contrary any non constant polynomial is always unbounded. sin(x) is a periodic function but no (non constant) polynomial may ever be a periodic function. Other simple but interesting facts can be derived by thinking about Taylor approximations of functions that actually ARE polynomials. While the approximation shall be perfect if its order is greater than or equal to the approximated polynomial, could be quite bad if truncated at a lower order. For example: let f (x) = ax + bx2 . The first order Taylor approximation in x = 0 is ax so that the error is the full bx2 which could be huge even near x = 0 if a is small compared to b. In our case, under some stationarity hypothesis of δt+1 nothing really bad should happen if we use the approximation for a single point in time and when δt+1 is near the approximation point (typically given by δ = E(δt+1 )). Even in this case, however, if the price dividend ratio is not equal to δ we shall have a positive error, that is: the ∗ approximation shall always undervalue rt+1 . This is not a problem if we do not sum many of these errors. If we take into account the remainder/error term and call it et+1 we get ∗ Rt+1 = k − ρδt+1 + δt + ∆dt + et+1 287 From this we have ∗ δt = ρδt+1 − k − ∆dt + Rt+1 − et+1 The same iteration as above yields δt = ρ m+1 δt+m+1 + m X ∗ ρj (Rt+j+1 − ∆dt+j − k − et+j+1 ) j=0 Hence, if we suppose, again, that the summation converges (in some sense) as m goes to infinity we have δt = ∞ X ∗ ρj (Rt+j+1 − ∆dt+j − et+j+1 ) − j=0 k 1+ρ Taking the expected value conditional on t we get now δt = ∞ X ∗ ρj Et (Rt+j+1 − ∆dt+j − et+j+1 ) − j=0 k 1+ρ We see that, if we take into account the error term, the expected value is now always smaller than what we could get from the approximate formula. In the discounted expected dividend changes interpretation given above this amounts to say that the ∗ ∗ and the sum of to Rt+j+1 error acts decreasing the risk premium from Rt+j+1 P∞− et+j+1 j the error terms appearing in the right hand side shall be j=0 ρ (et+j+1 ). Under the stationarity hypothesis we made the expected value of this can be bounded as: ∞ X j=0 ρj Et (et+j+1 ) ≤ ∞ X 1 V ar(δt+1 ) ρj V ar(δt+1 ) = 8 8(1 − ρ) j=0 If the expected dividends are small with respect to the price, that is: if ρ is near 1 (remember: ρ = 1/(1+eδ ) ) the overall error could be quite big. This would imply that the above interpretation of the current log dividend price ratio as the expected value of future discounted differences between expected log returns and expected dividend log returns would severely overestimate the current dividend price ratio. It is also useful to remember that, being the observable δt a sum of the approximation and the error it may well be that the stochastic properties of δt are quite different than the properties of the approximation. In fact this analysis is still incomplete as the error term is only formally introduced in a difference equation which is still believed to be linear. What we did above is to study the implication of substituting the error for each observation in a linear difference 288 equation which we held as correct. This is not the case. The error itself contains terms which depend on δt+1 . If we explicit this we have: ∗ δt = ρδt+1 − k − ∆dt + Rt+1 − eδL (δt+1 − δ)2 2 = η1 δt+1 + η2 δt+1 + η0,t (1 + eδL )2 2 While it should be noticed that δL itself depends on δt+1 . In any case, even supposing δL constant we have a quadratic difference equation of which we should study the dynamic properties. This could be quite a problem as this quadratic equation, depending on the values of the parameters, could be dangerously similar to those of the logistic equation and so completely different from those of a simple linear equation (the approximation). To understand this try iterating the formula. As far as I know in the (huge) literature using Campbell and Shiller approximation, the quality of the approximation is usually studied in a point wise way (observation by observation) where it works at least when the dividend price ratio has small variance (and now we know why), while the difference equation is applied over long (possibly infinite, as above) time horizons. A partial exception can be found in “The LogLinear Return Approximation, Bubbles, and Predictability”: Tom Engsted, Thomas Q. Pedersen and Carsten Tanggaard, JFQA (2012). 20.14 *Cochrane toy model. 20.14.1 *Forecasting stock returns: to make a long story short The problem of the forecastability of stock returns has (obviously) a long story inside and outside academic Finance literature. Since markets exist and prices of goods have been seen to change in time, humans have been interested in two related jobs: reducing the risk of losing due to price fluctuations (risk management), forecasting such changes in order to invest in the good whose price would go up before this happened (speculative trading). About risk management. The Ecclesiastes 11.1-3 says “Cast your bread on the surface of the waters, for you will find it after many days. Divide your portion to seven, or even to eight, for you do not know what misfortune may occur on the earth. If the clouds are full, they pour out rain upon the earth; and whether a tree falls toward the south or toward the north, wherever the tree falls, there it lies. He who watches the wind will not sow and he who looks at the clouds will not reap. Just as you do not know the path of the wind and how bones {are formed} in the womb of the pregnant woman, so you do not know the activity of God who makes all things”. Which, according to experts, means in modern terms: “ trade your goods because you’ll get reward from trading. Do not invest everything you have in just one project because you do not know which project shall be successful. Some events are random and unforecastable this not withstanding you must take decisions and act”. 289 About speculative trading. Thales of Miletus (fl. 332 B.C.) is told to have bought the rights to use oil presses before an harvest and resold these to olive producers with huge profit. In the Ecclesiastes quotation, an interesting point is the connection set between the will of God and randomness. This is a very common finding in antiquity. Very roughly speaking a notion common to many different and distant cultures was (and somewhat still is) this: the God’s will is made manifest in what we cannot forecast. For this reason any event we would term “random” had a “divine” content and, in fact, augures (not to be confused with oracles) derived their wisdom from a detailed observation of these random events. In this sense gambling had a sacred content as “luck” was nothing but a manifest sign of the favor of the God. For this reason version of objects we connect with gambling are commonly found on sacred grounds (astragalos, throwing sticks, dice, fortune bones etc.). There is a corollary to this which seems to have been (and for some still be) quite widespread: since God manifests itself in gambling it would be sacrilege to try and “forecast” God’s manifestation in gamble. This is one of the reasons which are suggested to explain the mystery of the fact that peoples mathematically advanced, the like of Egyptians, Sumerian, Greek and Chinese, while spending most of their time and fortunes in gambling did not develop any even elementary form of Probability Calculus, while possessing all the basic mathematical tools to do so. This “divine” attitude toward randomness had its critics. For instance Cicero in “De divinatione”, a dialogue we usually do not study in high school where his counseling exploits are preferred, discusses with remarkable clarity very modern ideas about “randomness” and “chance”. Among many interesting points, for instance: a definition of random event as an event whose happening or not cannot be determined by what is known before the event and the qualitative if not quantitative understanding that it is easier to get by chance a given result with two dice that to get by chance a beautiful painting by randomly throwing colors on a canvas (amazingly modern in this!), he also states a very important principle according to which, since something is forecastable only if it depends on what happens before the event and study of the connection between “causes” and “effects” require time and thought, “forecasting” when this is possible, should be left to experts, while aruspices dabble in trying to forecast what actually is random. A big innovation was made by the Christian and Islamic religions which, both in view of debasing competing creeds and as a one way recovery of part of biblical teaching, practically forbade gambling, fortune telling and lending at an interest (and here the point is that these three things were considered connected). In some extreme versions any monetary saving or even any saving that was not that of seeds for the next year, was considered blasphemous. Francesco from Assisi’s idea is even more extreme: we are part of nature and if we behave like the rest of nature we shall be provided by God in the same way he provides birds and flowers. So any saving is in some sense a 290 lack of faith in divine providence. Something was saved of antiquity: God’s will would not and could not be forecasted. The only way to be preserved by its wrath was prayer, not forecasting or risk management. However God shall be fruitful for those who follow its will. So, similar to ancient beliefs on gambling, if things go well for you this is a sign that God is with you. This is an interesting point which begins to be more and more relevant with the end of middle ages. In a full agreement with the Ecclesiastes, and somewhat recovering ancient ideas about gambling, merchants did really believe that to take a risk, not by gambling but by “venture” was a way to put one’s virtues in front of God and success, or defeat, was a proof of one’s qualities not only as a merchant but as a good Christian (or Muslim). Today it is difficult to understand the difference perceived by our forebearers between making money by trading goods in Europe fairs, speculating about the future price of a good (fine) and buying it in excess if you believe in and increase of its price (suspect) or lending at interest to people willing to deal in these two activities (forbidden). However many contemporary widespread attitudes w.r.t. banks, financial markets, mortgages etc. (in particular those common ideas which resurface during times of crisis) can be understood only in the light of these ancient stereotypes. Fast forward to US financial markets at the end of nineteen century. In the “stocks for the long run” section we discussed some properties of a famous dataset by Shiller (http://www.econ.yale.edu/~shiller/data.htm). This dataset contains yearly and monthly data about stock and bond indexes, in real and nominal terms, from 1871 to present day. 2500 and more years. Moreover data is available on dividends and earnings. By itself, a detailed study of this dataset properties, how it was compiled, which statistical problems are implied by the construction procedure, and so on, would require the time of this course and more. We are completely avoiding these important points and shall simple use this dataset in order to discuss some points related with stock return forecasting. Since our purpose is just to introduce Cochrane’s paper in the next section we shall be very quick in doing so. Some academic interest in financial markets has existed for lots of time and with very advanced results. However it is only during the fifties of the past century that what is today defined as the study of Finance (at least in an academic setting) moved its first steps. While most of practitioners were (and are) interesting in stating the ”prospects” of possible investments, the first studies in Finance were dedicated to another objective. In the market we observe stocks whose returns show completely different properties in terms of distribution of future returns (at least as these properties can be estimated using past data). So the initial question was different, and more interesting: “why all these very different prospects of returns are traded and find a price every day in the 291 market”. “Why is it not the case that only the “best” survive and the rest disappear”?. In other words the question, phrased in simple terms connected with just one statistical properties, is “How do we explain the cross section of expected stock returns”. Note: the question gives for granted that some estimate of expected return (and more) is possible. It does not question such estimate and concentrates how how it is possible that stocks with different (but known) expected returns are all traded at the same time. It is not here the place to deal with this question and with the advancements which start with Sharpe and evolve during 30 years to Fama and French and beyond. As the reader knows the basic idea to solve this puzzle is that different expected returns are reasonable in equilibrium if they are the result of different amounts of non diversifiable risk paid at a rate that only depends on the specific risk. An increasing number of empirical studies were directed to proof or confute such theories and, at least beginning with the eighties, these studies began to have a noticeable influence on actual asset management procedures. Up to at least the first half of the seventies most academic students of Finance shared the idea that, at least for stock prices, a simple log random walk model was all that was needed. Once you knew expected value, variance and covariances of stock returns, this was it. Most of the studies concentrated on expected values, then, during the eighties covariances and variances were also studied in detail and the Gaussian hypothesis was partially removed. However: no forecasting, just random walk. While some studies indicated the possibility of return predictability this was considered more a puzzle than a feature of return. Something to explain out, not something on which to build. In part, at least up to somewhen in the middle of the seventies, this came from the idea that existed some theorem of Probability which told you that “properly forecasted prices follow a random walk” so that forecastability was seen as an instance of the abhorred “free lunch” which took irrationality in Economics. This is not correct but this is not the place to discuss such ideas. It is also to be admitted that the quality of financial data available to academic researchers during this time was quite poor. Moreover such data mostly referred to the period between 1950 and 1980 which, in the light of what did happen afterward, could be considered a very strange and uninteresting period. Things changed during the second half of the eighties. See for a very relevant example: Fama, E. F., and K. R. French. 1988. Dividend Yields and Expected Stock Returns. Journal of Financial Economics 22:3–25. More and more high quality and detailed data became available. More powerful statistical tools were conceived and low cost computers on which to run them became a commodity. What is more important, markets changed, cold war ended, financial crises (even in USA) were back, sovereign debts became gigantic and happily traded. Students of stock prices begun observing “anomalies” and ended in finding “properties”. 292 These anomalies/properties were, at first, included in standard cross sectional models by devising more and more risk factors which determined expected returns (see for example the series of Fama and French papers starting in the late eighties). More and more these anomalies begun to take the form of expected returns or other moments of the return distribution whose value changed conditional on the available information. In the meantime, the almost religious abhorrence of forecastability (see above for historical precedents) was at leas partially tamed in two very different ways. The first is the understanding that conditional expected values (variances, covariances) changing in time did not mean “free lunches”. The second is the more and more widespread idea that economy and finance are, after all, driven by true human behavior and that any simple description of such a behavior by some naive “rational” model of identical agent was perhaps too narrow a paradigm. Behavioral Finance was the result of such ideas. The expository paper by Cochrane we are going to discuss is to be intended as a review and an attempt to systematize a part of the efforts toward studying stock return predictability as an actual possibility. A quick word about data. In most papers long run analysis of returns are done using some version of a total return, inflation corrected, U.S. stock index. The two most frequently used datasets are: • The already mentioned Shiller dataset from “Irrational Exuberance”.You can find both an yearly and a monthly version it here: http://www.econ.yale.edu/ ~shiller/data.htm • The dataset used by Cochrane in his “Dog that did not Bark” paper, and the MATLAB programs for the computations in the paper are in: http://faculty. chicagobooth.edu/john.cochrane/research/Data_and_Programs/The_Dog_That_ Didnt_Bark/ • A dataset similar but not identical to Shiller’s and containing monthly, quarterly and yearly data:http://www.hec.unil.ch/agoyal/docs/PredictorData2012. xls We should go into the details of the datasets but we won’t, here. These are all backward reconstructed index data. For this reason they are fraught with statistical problems connected with, but not limited to, the evolution of the index compositions. We do not consider these important points here. However we suggests the student to compare the datasets and try to find reasons for the non negligible differences. In particular be very careful when comparing data on different time periods. For instance, the monthly version of Shiller’s data dividends per month are not actually computed but interpolated from yearly dividends. For this reason monthly total returns shall be less variable than they should (we shall briefly comment on this in what follows). 293 It is extremely relevant that anyone interested with the history of USA stock market behavior be acquainted with the relevant historical data and with the problems intimately connected with collection of historical data over relevant time spans and this is a good place to start. 20.14.2 *A quick look at the data and some sketch of hypothesis In order to show how this approximation is currently used in the analysis of stock returns, with all its pros and cons, we present here a critical analysis of a famous (and not too hard from the technical p.o.w.) paper by J. H. Cochrane. In Cochrane, J. H. "The Dog That Did Not Bark: A Defense of Return Predictability." Review of Financial Studies, 21 (2008), the Author discusses some interesting points wrt the forecastability of stock total returns, in particular on the basis of the dividend price ratio. Cochrane tries to put some structure on the problem, and suggests a simple model for the dynamics of log returns (total returns), log dividend price ratio and log dividend growth is introduced. (To simplify notation from now on, and to be consistent with Cochrane’s notation, in this section we indicate the total log return with r instead than with R∗ ). rt+1 = ar + br δt + rt+1 ∆dt+1 = ad + bd δt + dt+1 δt+1 = aδ + φδt + δt+1 This can be understood as a simple rule of thumb model for forecasting returns and dividends and, at first, it is used in an informal, exploratory way in order to state our abilities of forecasting stock total returns (rt+1 ). A direct naive estimate of this model based on yearly data (1926-2004) gives the following results72 : yearly data estimate est-stdev t-test P-value R2 br 0.092039 0.051225 0.0763 0.04 bd 0.007486 0.029968 0.8034 0.0008 φ 0.945744 0.043919 0.0000 0.8575 This kind of analysis uses data in real terms, we do not comment here about this use which could be a good first approximation. The data we use are from Shiller “Irrational exuberance” and the results are similar but not identical to what you get using data by Cochrane. We use Shiller data because a monthly version is available, recall the warning we just made about how monthly dividends are computed. Roughly speaking Shiller reconstructs the S&P composite index while Cochrane uses the CRSP index. Results with other datasets are qualitatively similar but quantitatively different. In particular the estimated standard deviation for br may be quite different across different datasets. As usual a long term effect may only be estimated with long range data and, after all, modern Finance is a quite young phenomenon. 72 294 Before commenting the results we repeat the same analysis with monthly data from the same source and time period: monthly data estimate est-stdev t-test P-value R2 br 0.004088 0.003446 0.2358 0.001486 bd -0.003013 0.000890 0.0007 0.011968 φ 0.995886 0.003545 0.0000 0.9881 It is to be noticed that the monthly dividends are computed by interpolation in Shiller data. Since what is interests us are the main characteristics of the results this is not a big problem. However, if we wish to go into details, this should be a point to discuss. A first simple reading of these exploratory results is as follows: 1. We can forecast a (small) part of the variance of future returns even using only the dividend price ratio. Forecastability seems to increase passing from monthly to yearly data. 2. The rate of change in dividends seems independent on the dividend price level 3. The price dividend ratio is very persistent, that is: future price dividend ratios are easy to forecast on the basis of past price dividend ratios. Forecastability, even at yearly level, seems an almost negligible phenomenon but: do not be fooled by the 4% R2 . An interesting possible consequence of 1+3 is that it should not be much more difficult to forecast single period returns more far in the far future than a single period return just one period ahead on the basis of price dividends since price dividends are very persistent. In fact high persistence of log dividend price ratios means these are easy to forecast for a rather long horizon. So we can apply our 4% R2 not just to the next return but to a rather long stretch of future returns. Now the magic: the fact that a part of the variance of returns can be forecasted using a very persistent series opens another interesting possibility. Since this happens, this implies that a part of returns variance is due to a persistent (very much autocorrelated) component. However, if we consider returns on, say, a year, return themselves are not persistent: a regression of current returns on lagged returns gives an estimated slope of 0.076306 with a standard error of 0.113613 and a p-value of 0.5038 corresponding to an R-square of 0.005. It is then possible to write each (say) monthly return as the sum of a (small variance) persistent (autocorrelated) component, due to the dependence from the price dividend ratio and a (big variance) non autocorrelated component. Let’s write this as rt = at + wt where we suppose that between the two components correlation is 0. We know that log returns over many time periods (say n) are the sum of log returns over subperiods. This is true for simple log returns and also for log total returns with the reinvestment convention we used above. 295 P P P In this case rnt = nt=1 rt = nt=1 at + nt=1 wt In this case the variance of a return over n time periods shall be given by (supposing perfect autocorrelation between the persistent components and zero autocorrelation between the non persistent components) V (rnt ) = V ( n X t=1 n X at ) + V ( wt ) = n2 V (a) + nV (w) t=1 Pn Under these hypotheses, in fact, (assuming the expected value t=1 at = a1 n Pn Pand n of t=1 wt to be equal to 0) a regression of rnt on t=1 at = a1 n shall give a beta of 1 (a) which tends to 1 if n increases. and an R2 = nV nV (a)+v(w) For instance, deriving rough numbers from our data, if V (a) = 1.5 and V (w) = 998.5, while for 1 time period only 0.15% of the variance of returns is due to the forecastable a component, for, say, 12 time periods the ratio becomes 216/(216 + 11982) = 1.77%. If we enlarge the time interval even more, say to 120, we get 21600/(21600 + 119820) = 15.27%. The longer the horizon the better the forecast because the variance of the forecastable part grows with the square of n and the rest with n. Actually, things could be even better. . If this is true, by forecasting price dividend ratios and using these to forecast one period returns then summing these up, we should be able to forecast a sizable part of log total returns over longer term periods. The longer the length of the time period, the bigger the forecastable part. It is important to understand the exact meaning of the sentence “by forecasting price dividends ratios”. If we are at time t and wish to forecast the return over n time periods we have two possible solutions. The first is to do what Cochrane does and what we did a moment ago, that is: estimate a model that connects rt+1 to δt and a model which connects δ t+1 with δt . Then, since rt+m is, with the proper assumption of dividend reinvesting, the sum of m one period total returns, and each of these one period returns depends on the corresponding price dividend ratio, use the price dividend ratio model to forecast the m intermediate price dividend ratios (in the Cochrane simple model each of these shall be a function of the same δt ) then use each of these forecasts to forecast the corresponding one period return and get the estimate of the n period returns by summing these forecasts. The second possibility is to take return data at periodicity m, regress these on the m corresponding one period price dividend ratios then, at time t, forecast these price dividend ratios and use the forecasts in the model for rt+m . It is clear that the second possibility is less useful that the first as, the greater m, the smaller the available sample and the worst the estimates. However the first possibility heavily depends on both the approximation of m period returns based on sums of 1 period returns and on the single period models for rt+1 and δt+1 . Let us now go back to our simple set of assumptions. Obviously this is a very quick sketch. For instance, while in our data we see that it is actually easier to forecast 296 monthly than yearly returns using dividend price ratios, the R2 goes from about 0.15% to 4%, not to 1.77%. This means that, at least passing from monthly to yearly returns, forecastability improves more than expected in our very simple sketch. Surely we are forgetting something important, probably many important things. A possible explanation is that data on dividends are quite sparse during the year and this could have an influence in degrading forecasts within the year which is not there between years. Moreover wt actually could be not uncorrelated but slightly negatively correlated. If this is true, the variance of their sum shall increase less than we expect. Moreover variances could depend on time and so on. We can check further our understanding of the data passing from yearly to, say 10 year returns. In this case we should pass from a 4% R2 to something in the range of 29% (400/(400+960)=0.294), The obvious problem is that we only have 9 data points in this case (we go up to 2006). This lack of data, by itself, could generate an positive bias in the estimate of the R2 . However what we get if we run the regression on decadal data is an R2 of 0.38 and if we consider the estimate of R2 corrected for the degrees of freedom we get 0.297 which is even too good to be true. Another possible problem is as follows. It has long been known that the variance of returns grows less than linearly. That is: the variance of returns over, say, 10 time periods is less than 10 times the variance of returns over 1 time period. The above sketched argument would imply, instead, that the variance of returns should increase faster than the number of periods over which returns are computed. In our dataset the yearly variance of returns is 0.03584 so we should observe a 10 year variance above 0.3584 (equal to this if we had no autocorrelated component). The actual 10 years return variance is only 0.1274. This may be due to a small sample size. However the monthly return variance is 0.002094801 so we should expect something like 12 times this for one year, that is 0.02514 while we get 0.03583449 which is higher. So, the picture is not as clear cut as we would like it to be. It may well be that w contains some component with a sizable negative autocorrelation over the long run and/or a is not simply perfectly autocorrelated (we know it is not) but has a more complex autocorrelation structure with strong positive first order component but negative second order component. It may also be that the interpolated nature of monthly dividends somewhat alters the picture. Be it what it be, what is clear is that the study of return forecastability, simply ruled out buy orthodox Finance journals up to the first half of the nineties, is back with a vengeance and much interest. Orthodox literature is now full of papers which try to explain this forecastability, to find variables other than the dividend price ratio able to explain it. Consumption is a strong candidate in view of standard consumption and investment models and considering its huge autocorrelation but others are entering the fray each month. One of your teachers favors variables connected with life expectancy while another believes that some long run component in consumption does the trick. 297 The writer of this handouts well rests among those who believe it to be very difficult to directly test for long run forecastability in financial data, this simply because we do not have long run data. Finance as we know it is rather a new phenomenon and we can at most study long run implications of short run models, as Cochrane tries to to, but cannot test forecastability hypothesis directly on non existent long run data. Be careful: we are dealing with long run forecastability. The philosopher’s stone of a (legally admissible) machine for forecasting with success tomorrow stock prices on the basis of public information was and remains a chimaera. Moreover the Reader shall not forget that we are trying to study long run properties as consequences of short run models. Finance as we intend it is still a relatively recent phenomenon dating back no more than a couple of centuries so there exists very little direct information on long run properties of returns. The problem is that the study of long run properties of short run models is a study of consequences of short run estimates not a direct estimate of effects. Moreover, the evaluation of such long run implications of short run models, and this shall be of paramount relevance for us, must typically be based on the iteration of a short run model (typically a stochastic difference equation) over many time periods. If the short run model is based on some approximation which, on each single period, has maybe negligible error, but whose errors sum up to sizable totals over many periods (this should ring a bell!) problems may and do arise. In what follows we shall try to clarify how much the linearization of log total returns interacts with Cochrane’s model and can help or hinder the study of the long run return forecastability problem. 20.14.3 *A more detailed look to Cochrane’s model: one step ahead forecasts In fact, this apparently simple model looks quite strange if you but think about it a little. Log total return rt+1 (recall the change in notation from R∗ to r) is a function of dividends and prices, the log dividend price ratio δt+1 is a function of dividend and prices and the log rate of increase of dividends is a function of dividends. It is reasonable to model somewhat independently two of this quantities not all three of them. Three equations for only two underlying variables is a little too much and we can see this, without any approximation, in what follows. Let us start from the definition of log total return rt+1 = ln(1 + eδt+1 ) + ln Pt+1 Pt hence rt+1 − ∆dt+1 = ln(1 + eδt+1 ) − δt+1 + δt 298 and if we use in this identity the equations in the model we have δ ar + br δt + rt+1 − ad − bd δt − dt+1 = ln(1 + eaδ +φδt +t+1 ) − aδ + (1 − φ)δt − δt+1 δ rt+1 − dt+1 − ln(1 + eaδ +φδt +t+1 ) + δt+1 = ad − ar − aδ + δt (1 − φ + bd − br ) This is an identity, so it must always be true. Actually this is a very nasty stochastic nonlinear identity where, at time t, the rhs only contains non stochastic terms while the lhs contains functions of three stochastic variables (the epsilons). Hence, as we could expect, since we have three random variables functions of which must satisfy an identity (the definition of log total return), we have that there must me some nonlinear but exact, that is: functional, dependence between the three errors in the model. which can be true only under very peculiar hypotheses on the error terms (for instance, as we can expect, two of them imply the third). This dependence is what is described in the above formula: if we observe δt we can solve for the third given the other two (pick your choice). It is clear that this identity creates problems in the “simple” model by Cochrane as it implies that the epsilons are dependent on the value of δt ruling out the interpretation of the Cochrane model as a regression model (there exist dependence between the errors and the regressors). Since we have a functional dependence among the epsilons (and deltat ) all their stochastic properties shall be related. This implies that we cannot simply estimate the above model with, say, a VAR estimate without imposing the exact nonlinear restriction implied by the definition of log total return. A good way to express this identity in the Cochrane model would be to change the first of the three equations in the following (very nonlinear) way δ rt+1 = ad − aδ + (1 − φ + bd )δt + ln(1 + eaδ +φδt +t+1 ) + dt+1 − δt+1 Let us now pass to our approximation. Nothing changes except the fact that we shall use an approximate definition of log total return. rt+1 = k − ρδt+1 + δt + ∆dt and we write this, coherently with what we just did as rt+1 − ∆dt = k − ρδt+1 + δt to be compared with the exact rt+1 − ∆dt+1 = ln(1 + eδt+1 ) − δt+1 + δt Now everything is linear (we are forgetting the error) and we get ar + br δt + rt+1 − ad − bd δt − dt+1 = k − ρaδ + (1 − ρφ)δt + −ρδt+1 299 Again we have a (now linear) exact functional dependence between the errors rt+1 − dt+1 + ρδt+1 = k − ρaδ + (1 − ρφ)δt + ad + bd δt − ar − br δt Proceeding as before we could substitute this identity in the first equation of the model getting rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1 This must be compared with the first equation in Cochrane model rt+1 = ar + br δt + rt+1 So we have two versions of the equation in Cochrane’s model. The first is implied by the other two equations and the assumption that the approximation be exact. The second is the original version. If the approximation AND the model both work we should have that, whatever be δt the results of the two equations be the same. This imposes constraints to the coefficients and the error terms of the two equations. Let us now proceed as Cochrane. Suppose that the expected values of the epsilons are zero (conditional on δt ). If we take the difference of the expected values of the two versions of the same equation and require this difference to be 0 we have 0 = k − ρaδ + (1 − ρφ)δt + ad + bd δt − ar − br δt Since this must be true whatever δt be, we need for both the sum of the constant terms and the sum of the slope terms to be equal to 0 that is 0 = k − ρaδ + ad − ar 0 = 1 − ρφ + bd − br The second constraint is studied in the quoted paper by Cochrane as a necessary condition for the approximation and the three equation model to be consistent. A more correct (if less “nice”) reasoning is as follows (there is an hint to this in Cochrane’s paper). Suppose you want both the three equation model and the approximation to work whatever be the value of δt with epsilons independent on δt . In this case you require 0 = 1 − ρφ + bd − br (for independence on δt ), and rt+1 − dt+1 + ρδt+1 = k − ρaδ + ad − ar 300 So that the errors satisfy the (approximate) linear constraint. We clearly see that the simple test of H0 : 0 = 1 − ρφ + bd − br by itself is not sufficient as a test for the joint hypothesis of validity of both the approximation and the three equations model (with independence of errors and δt ). What Cochrane actually does is not checking if the approximation works with the model but if the approximation works on average with the model. A good test for this hypothesis would be given by the following procedure. First estimate the three equation model with the added constraint (we put back the constants which in Cochrane’s paper for some reason are forgotten) rt+1 − dt+1 + ρδt+1 = k − ρaδ + ad − ar Or, that is the same, estimate the three equations model after substituting the first equation with rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1 Second check for the condition 0 = 1 − ρφ + bd − br using the estimated values of the parameters. Remember that this second condition fully comes from the idea that the -s should not depend on δt . However we still have a problem: when we estimate the three equations (constrained) model we use “true” log total returns: rt+1 . The constrained first equation rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1 only truly applies to the approximate returns. As we did see above, true log total returns should be used not in this equation but in the nonlinear equation δ rt+1 = ad − aδ + (1 − φ + bd )δt + ln(1 + eaδ +φδt +t+1 ) + dt+1 − δt+1 For this reason we must conclude that the only possible “test” for the approximation and the three equation model comes from estimating the nonlinear three equation model, compute ρ and k according to their definitions, substitute all parameter values and epsilon values so estimated in the approximate equation for rt+1 rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1 and see if the fitted values obtained by this equation are similar to the true rt+1 series. This may seem complex as the nonlinear first equation is difficult to estimate but we must understand that, since the constrained first equation does not contain any parameter or residual that does not come from the other two equations, we really do 301 not need to estimate it (in technical terms it is not a constrained but a redundant equation) so that the procedure is very simple. Estimate the two remaining equations, δ compute k and ρ by expanding the term ln(1 + eaδ +φδt +t+1 ) (typically at the point δ = δ¯t or similar) and do as previously advised. An even simpler procedure start from the same estimates, computes rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1 leaving k and ρ unspecified and then regress rt+1 −ad −(1+bd )δt −dt+1 on a constant and aδ + δt /φ + δt+1 and look to the R square of this regression which should be very high (in principle equal to 1) in order to justify the Taylor approximation in the context of Cochrane model. Cochrane model is intended for a simple study the forecastability of log total returns. In a naive use of the model this is a question whose answer lies in the estimation of br . In Cochrane approximate, constrained version this has to do with the value of 1 − ρφ + bd . In the full version of the model the forecasting properties are more interesting. the ability of the log dividend price ratio for forecasting returns has a linear component which depends on the value of 1 − φ + bd and a nonlinear part which depends on δ ln(1 + eaδ +φδt +t+1 ). The difference between the true and the approximate parameter for δt , that is: ρφ − φcomes fully from the linear approximation of ln(1 + eaδ +φδt ) . Notice the implicit hypothesis of expanding δt+1 around its expected (conditional on δt ) value aδ + φδt . Moreover notice that ρ shall be a function of φ. For these reasons, supposing φpositive and smaller that 1 and bd positive, while in the approximate model, we have that the effect of an increase of δt on the expected value of rt+1 is positive and the same for any level of δt , in the full model we have that this effect shall be smaller δt is smaller and bigger the bigger is δt . For positive values of φ we shall have an effect that goes, roughly, from 1 − φ + bd when dividends are near zero to 1 − φ/2 + bd in the limit case where price and dividends are the same. For this reason in the approximate version the ρ is expected smaller than, but near, 1. 20.14.4 *A more detailed look to Cochrane’s model: iterate over many time periods In section 2.2 of his paper Cochrane computes “long run” coefficients. Actually we have two kinds of long run coefficients, Cochrane only considers one but here we shall deal with both. Let us do it step by step. We begin by rephrasing Cochrane argument (we hope this version shall be clearer than the original). The starting point is the iteration of the linearized log total return (here we continue 302 to use r for the log total return) ∞ X ∞ ∞ X X k k δt = = ρj rt+j+1 − ρj ∆dt+j − ρ (rt+j+1 − ∆dt+j ) − 1+ρ 1+ρ j=0 j=0 j=0 j Now multiply on the left and on the right by δt − E(δt ) and take the expected value (and remember that the expected value of the difference from the expected value is 0) ∞ ∞ X X j V (δt ) = Cov( ρ rt+j+1 ; δt ) − Cov( ρj ∆dt+j ; δt ) = j=0 = ∞ X j=0 j ρ Cov(rt+j+1 ; δt ) − j=0 ∞ X ρj Cov(∆dt+j ; δt ) j=0 Where the second equality simply depends on the exchangeability of the expected value operator and of the sum operator (really this is not a sum, it is a series, and we are supposing something more than this but it is not the case here to be too picky). This is obviously an identity if the required moments exist and if we forget the approximation error. An identity is always true so, please, do not try to give it any economic interpretation. If we divide everything by V (δt ) we get 1 = β( ∞ X ∞ ∞ ∞ X X X j j ρ rt+j+1 ; δt ) − β( ρ ∆dt+j ; δt ) = ρ β(rt+j+1 ; δt ) − ρj β(∆dt+j ; δt ) j j=0 j=0 j=0 j=0 Where β(a; b) = cov(a; b)/V (b) is the univariate regression coefficient between a and b. It is interesting to notice that this equation tells us (but this is again a tautology, if everything converges in the right sense) that the regression coefficient we get by regressing the series of future (discounted with ρ) total log returns on the current log price dividend ratio is identical to the series of the (discounted) regression coefficients of each future return on the current log dividend price ratio. If we now apply the three equations Cochrane model, we are easily able to compute these betas as, by direct substitution, we have β(rt+j+1 ; δt ) = br φj β(∆dt+j ; δt ) = bd φj it is then easy to compute ∞ X j=0 ρj β(rt+j+1 ; δt ) = ∞ X j=0 303 ρj br φj = br 1 − ρφ And a similar formula for the other parameter which shall be less relevant because we know that the estimate of bd is very near 0. This is what Cochrane calls long run coefficient. The reason for this name is that it is the value (given the model and the approximation) of the regression coefficient of the series of future (discounted) returns on the current log dividend price ratio. By repeating the arguments in the previous section we should check the quality of the model and the approximation by verifying if the constraint implied by the approximation actually hold but, since no direct estimate of the long run coefficient is possible, this would tell us nothing of relevance except the fact that an error in respecting the constraint would be amplified by the division for the term 1 − ρφwhich is going to be quite small. There is a second “long run coefficient” of interest for us, which is more connected that the former with long run forecastability and it is going to add an interpretation to Cochrane’s result. Let us start from the approximated model (the constrained first equation) rt+1 = k + ad − ρaδ + (1 − ρφ + bd )δt + dt+1 − ρδt+1 suppose we want to forecast the log total return (with reinvested dividends) over n + 1time periods n X rt+j+1 = (n + 1)(k + ad − ρaδ ) + (1 − ρφ + bd ) j=0 n X δt+j + j=0 n X (dt+j+1 − ρδt+j+1 ) j=0 which, according to the simple model for δt is equal to n X rt+j+1 = (n+1)(k+ad −ρaδ )+(1−ρφ+bd ) j=0 n X n n X X φj δt + φj δt+j+1 + (dt+j+1 −ρδt+j+1 ) j=0 j=0 j=0 as before multiply on both sides by δt − E(δt ) and divide by the variance of δt . We get β( n X j=0 rt+j+1 ; δt ) = (1 − ρφ + bd ) n X φj = (1 − ρφ + bd ) j=0 1 1 − φn+1 −→ (1 − ρφ + bd ) 1 − φ n→∞ 1−φ (The limit is true is the absolute value of φ is less than 1). If we wish to compute the R2 of the regression we need the variance of the total return from t to t + n + 1 conditional to t − n − 1 we have, supposing stationarity and no correlation across the n n n X X X V( rt+j+1 |t−n−1) = (1−ρφ+bd )2 V (δt |t−n−1)( φj )2 + (φj −ρ)2 V (δ )+(n+1)V (d ) j=0 j=0 304 j=0 Recall that, if we regress total return over n + 1time periods over the log dividend price ratio at the beginning of each period of length n + 1 we just use one log dividend price ratio each n + 1. So the relevant variance of the log dividend price ratio for computing the R2 of the regression is the variance of this “interval sampled” series of dividends and, given the simple autoregressive model for the log dividend price ratio, this variance shall be bigger that the variance of the same series sampled at each data point. In fact V (δt |t − n − 1) = V (δt−n−1 n X j φ + j=0 n X φj δt−n−1+j ) j=0 δ = V ( ) n X φj j=0 Some comments. If the model plus the Campbell-Shiller approximation work, we should have (see above) (1 − ρφ + bd ) ≈ br and, for large n we should then have (suppose, obviously, φ between 0 and 1 and strictly less than 1) n X β( rt+j+1 ; δt ) ≈ j=0 br 1−φ That is, exactly what we find in the Cochrane analysis but without the ρ. This result states that, the longer the time period over which we compute a return, the bigger is going to be the regression coefficient of this return on the log price dividend ratio. This confirms our “back of the envelope analysis” made under the hypothesis that log dividend price ratios be perfectly autocorrelated but it adds an important insight: due to the non perfect autocorrelation (φ is less that 1) we have that the growth of the regression coefficient shall not be linear and unbounded but less than linear and bounded. with our simple OLS monthly data estimates we get .0041(1 − .99613 )/(1 − .996) = .052 which, if we consider the size of the standard errors involved (and the rough treatment of monthly dividends), is not so far from the .092 estimated with yearly data. With these estimates the limiting value for β shall be equal to .0041/(1 − .996) = 1.025. From the above formula for the variance of cumulated log total returns we have Pn 2 1−R = (1 − ρφ + bd )2 V j 2 δ j=0 (φ − ρ) V ( ) + (n P P (δ )( nj=0 φj )3 + nj=0 (φj + 1)V (d ) − ρ)2 V (δ ) + (n + 1)V (d ) Since everything is bounded, except the n in the numerator and in the denominator which are both multiplied by the same constant, this goes to 1 for n going to infinity, except in the case φ = 1. In this case the formula simplifies to 1 − R2 = (n + 1)(1 − ρ)2 V (δ ) + (n + 1)V (d ) (1 − ρφ + bd )2 V (δ )(n + 1)3 + (n + 1)(1 − ρ)2 V (δ ) + (n + 1)V (d ) 305 and clearly this goes to 0 as 1/n2 if n goes to infinity. In intermediate cases (from our data we see that φis very near to 1) we have that, starting from the value 1 − R2 = (1 − ρ)2 V (δ ) + V (d ) (1 − ρφ + bd )2 V (δ ) + (1 − ρ)2 V (δ ) + V (d ) The value of 1 − R2 shall first decrease and then increase with a limit of 1. Summing up, if the approximation and the simple model work, we have that log dividend price ratios shall have increasing forecast power on log total returns if these returns are computed over longer time periods up to some time horizon depending on the parameters of the model. Notice that the smaller is the variance of d (which, according to our empirical results shall be almost the same as the variance of ∆dt ) the longer the time interval over which the forecasting power of the regression shall increase. In the limit, if this variance is 0, that is, if the dividend flow is constant, 1 − R2 decreases to a lower bound which is smaller the smaller is V (δ ). This is not surprising. In fact, if we start from the approximation rt+1 = k − ρδt+1 + δt + ∆dt suppose ∆dt constant (to be simple equal to 0 but nothing relevant changes) and use the equation for the evolution of δt we get rt+1 = k − ρ(aδ + φδt + δt+1 ) + δt = k − ρaδ + (1 − ρφ)δt − ρδt+1 and the better you can forecast δ that is, the smaller V (δ ), the better the forecast no disturbance being induced by random changes in dividends. More simply: if dividends are constant, forecasting log dividend price ratios is equivalent to forecasting minus log prices. 21 *Appendix: Some further info about the use of regression models The following discussion is a simplified version of some section from Haavelmo (1943) paper (Trygve Haavelmo: “The Statistical Implications of a System of Simultaneous Equations”, Econometrics, 11, 1, Jan. 1943, pp. 1-12). In Economics, as a rule, many variables interact in order to maintain a system in equilibrium, a typical setting is the supply/demand interpretation of market equilibrium. let us begin by considering a consumer who confronts a given price P for a good. For any given fixed price P we could assume that the quantity Q that the consumer 306 decides to buy is given by, say Q = α + βP + e1 where we may suppose e1 to have expected value of 0. We may understand this as our model for the answer of the consumer to a set of questions about the quantity the consumer would be willing to buy for a given unit price where the unit price is not necessarily a market price but is a price set by us who ask the question. In this setting (where P is not stochastic) it is easy to see that, for any given P , we have E(Q|P ) = α + βP . On the other hand, let us see the thing from the point of view of the “seller” (offer). Suppose the seller is asked for a GIVEN quantity Q, we may suppose the seller shall require a price equal to P = γ + δQ + e2 . Again let us suppose E(e2 ) = 0. In this case we may say that E(P |Q) = γ + δQ. Now, both these results are exact under the hypothesis that, in the first, P is given and, in the second, Q is given. In a sense we are implicitly considering two experiments: in the first the consumer is confronted with different given levels of price and asked for the desired quantity at that price, in the second the seller is confronted with given quantities and asked for the price. The two random element (e1 , e2 ) take into account the fact that the answers for the same fixed P or Q shall not be always the same. This could be, and we suppose it to be, a good description for the specific experimental setting. Suppose now we let the market work: we do not set price and ask for quantities or vice versa, we just observe price and quantities as set by the market. Let us suppose that, not withstanding the fact that we are no more asking questions but we are now in an observational setting, equilibrium requires Q = α + βP + e1 and P = γ + δQ + e2 to hold simultaneously. The algebraic meaning of this is that both the Q and the P in the offer and in the demand function must have the same value (this was absolutely NOT required when the consumer and the seller answered to our questions). Moreover, let us suppose that both consumer and supplier do not alter their demand and supply function due to the fact that they are now in a true market situation and not simply answering questions. This, as stated in the text, is a very strong hypothesis. Under these conditions we have a system of two equations in four unknown and we can express the equilibrium values of both P and Q as functions of e1 , e2 . We have P = (α + βγ + βe2 + e1 )/(1 − βδ) and Q = (γ + δα + e2 + δe1 )/(1 − βδ). Both P and Q depend (in different ways) from the same e1 and e2 . Now: what shall be, say, E(P |Q) in this setting? Suppose, for simplicity, that e1 and e2 be jointly Gaussian with expected values both 0 covariance 0 and variances V (e1 ) and V (e2 ). In this case with some easy computations we find that: E(P |Q) = γ + δα α + βγ − BP |Q + BP |Q Q 1 − βδ 1 − βδ where: BP |Q = βV (e2 ) + δV (e1 ) V (e2 ) + δ 2 V (e1 ) 307 This is NOT γ + δQ (except in particular cases). In fact, it would be better to use different symbols for the conditional expectation describing the answers of the seller to our questions, and this conditional expectation, which has to do with the information we can get on the EQUILIBRIUM P given the EQUILIBRIUM Q (we are going to do this in a moment).. In other words: even if the underlying hypothesis on demand and supply do not change, if we pass from the “experimental setting” to the “observational” setting (and, obviously, vice versa), the relevant regression functions are different. If I want to forecast P when I confront the seller wit a GIVEN Q (say: “experiment”), and let us indicate this with Q∗ , the forecast shall be E(P |Q∗ ) = γ + δQ∗ . However, if, confronted with market data, I want to forecast the price P for a − BP |Q γ+δα + quantity Q in equilibrium (“observational”), I shall use E(P |Q) = α+βγ 1−βδ 1−βδ BP |Q Q. What is happening is this: when I confront, say, the seller and ask for the unit price the seller would require for a given quantity, I am not requiring that the resulting price quantity pair be acceptable to the buyer. I am simply studying the “offer” that is: what would be the price asked for a given fixed quantity. This is NOT what we get from observations of price quantity pairs in the market. In this case only those pairs which are in equilibrium are observable. If I observe a Q I am not “choosing “ it, I observe it because it is an equilibrium Q and to this shall correspond and equilibrium P . The same, Q, if chosen by me or if accepted as an equilibrium value, has different meanings and yields different informations of the corresponding P even if we suppose that nothing changes in the parametrization of demand and supply. This is basic Economics and is a VERY SIMPLE example as, in more realistic situations, many more variables are considered in the model. It should be useful to understand the kind of analysis which is necessary in these cases. Notice, again, that here we did suppose that, in a sense, being in an experimental setting or in an observational setting “did not change” the “form of the model”. As noted above this is by no means the usual case. An aside: BP |Q is undefined if both e1 and e2 have zero variance. This is because in this case the equilibrium is just one point: a specific P and a specific Q. To be precise: P = (α + βγ)/(1 − βδ) and Q = (γ + δα)/(1 − βδ) (we suppose the four parameters to be such that the solution exists and is unique, in particular βδ 6= 1). Which regression function should we use? If what we wish for is a forecast of what we can expect P to be, given the Q observed in the market, we should use the equilibrium regression function. Suppose, however, that our purpose is to compel sellers to produce a given quantity ∗ Q , in order to asses which price we should pay for this out of equilibrium Q∗ we should use E(P |Q∗ ) = γ + δQ∗ . It may be interesting to notice that, if this is our purpose, and sellers know this, they could be induced to answer to our questions about the price for given quantity in a way that does not reflect what they would do when confronted 308 with market equilibrium. They could be induced to cheat and set, for instance, a bigger γ just because this would imply a bigger price for any Q∗ . In other words: when “experiments” are run on “strategic” individuals, these could be non informative about the individual behaviour outside the experiment. This is exactly what we supposed NOT to happen when we assumed the demand and supply functions not to change when elicited by our questions and when considered in the market equilibrium. If this invariance does not hold, observational data on equilibrium prices and quantities can be used for forecasts but not for policy evaluation. On the other hand, direct “experimental” data on supply and demand cannot be used for forecasting purposes and for intervention evaluation. This problem, whose likelihood is very high if we assume strategic agents, could be solved if we could model the way in which the behaviour of agents changes according to the different setups (un-tampered equilibrium, intervention, experimental observation). The debate about these these fundamental points about the use of Econometrics has been central in the Econometrics literature in the last 80 years (and, as such, it has contribute a lot in making the field interesting). Here is some quick reference. Haavelmo (1943) explicitly contains these considerations. A special case was considered by Robert Lucas (1976)73 . The specific case considered by Lucas is that of the different effect on agents expectations of data coming from “natural” market action and of data altered by policy actions. Lucas’s analysis is usually discussed under the term “Lucas critique”. Even if a given economic model describes in a correct way the joint probability distributions on economic observables in an economic system described when intervention is not active, any policy action could essentially change the system due to the (assume optimizing) reaction to the intervention by the economic agents. The basic idea of Lucas, echoing Haavelmo and, more simply, common sense, is that we should explicitly model the reaction of agents to policy actions and how this alters the overall equilibrium behaviour. If this is not the case, and such reaction alters equilibrium behaviour, the usefulness of historical data for evaluation policy action is in doubt, and forecasts of “effects” of policy action could be way off the mark. The analysis also implies, obviously, that, by converse, data observed when policy was in action would possibly be of doubtful use in estimating forecast models to be used when policy is not acting. Clearly, modeling the reaction of agents to policy action may be feasible at an individual agent micro level. At a macro level, that is: at the level of the full economic system, this is very difficult (and this is an euphemism) and possible only conditional to drastic simplification hypotheses (e.g. the “representative agent”) which may degrade Lucas Robert (1976). "Econometric Policy Evaluation: A Critique". In Brunner, K.; Meltzer, A. The Phillips Curve and Labor Markets. Carnegie-Rochester Conference Series on Public Policy. 1. New York: American Elsevier. pp. 19–46. 73 309 the realism of the model. Chris Sims (1980)74 , in one of the most quoted papers of Econometrics history, strongly makes the point that, due to the problems and potential arbitrary choices involved in the attempt of causal interpretation of econometric models, more care should be given to build models whose first purpose be forecasting. He also argues that such models, under reasonable assumptions, could be used for at least approximate policy evaluations. Sims elaborates on previous research, in particular on a paper by Ta Chung Liu (1960)75 and on the strain of research induced by Haavelmo paper. In recent years, many econometricians, in particular micro econometricians (see above for the comment on the Lucas critique), have based their interpretation of the “causal” use of (micro) econometric models on the approach summarized, for instance, by the work of Judea Pearl76 . This, on its turn, is based on the approach to causal inference developed by Donald Rubin and his co authors since the ’70es77 . It is to be noticed that Pearl and Rubin approach explicitly requires, in a very strict way, the kind of “structural stability” with respect to intervention briefly discussed above and, as a consequence, while maybe useful at a micro level, is probably not the best way to deal with macroeconometrics policy analysis. Sims, Christopher (January 1980) "Macroeconomics and reality". Econometrica. 48 (1): 1–48. Ta-Chung Liu, “Underidentification, Structural Estimation, and Forecasting” Econometrica, Vol. 28, No. 4. (Oct., 1960), pp. 855-865. 76 A good summary in: Judea Pearl, Madelyn Glymour and Nicholas Jewell, “Causal Inference in Statistics: A Primer”, Wiley, 2016. 77 See, for a good summary: Donald B. Rubin and Guido W. Imbens (2015), “Causal Inference for Statistics, Social, and Biomedical Sciences”, Cambridge University Press 74 75 310 Contents 1 Returns 1.1 Return definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Price and return data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Some empirical “facts” . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 20 22 2 Logarithmic random walk 2.1 "Stocks for the long run" and time diversification . . . . . . . . . . . . 2.2 *Some further consideration about log and linear returns . . . . . . . . 26 35 42 3 Volatility estimation 3.1 Is it easier to estimate µ or σ 2 ? . . . . . . . . . . . . . . . . . . . . . . 46 51 4 Non Gaussian returns 55 5 Four different ways for computing 5.1 Gaussian VaR . . . . . . . . . . . 5.2 Non parametric VaR . . . . . . . 5.3 Semi parametric VaR . . . . . . . 5.4 Mixture of Gaussians . . . . . . . the VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 65 69 75 79 6 Matrix algebra 83 7 Matrix algebra and Statistics 7.1 Risk budgeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 A varcov matrix is at least psd . . . . . . . . . . . . . . . . . . . . . . . 7.3 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 87 88 89 8 The deFinetti, Markowitz and Roy model for asset allocation 89 9 Linear regression 94 9.1 Weak OLS hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 9.2 The OLS estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9.3 The Gauss Markoff theorem . . . . . . . . . . . . . . . . . . . . . . . . 96 9.4 Fit and errors of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.5 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.6 More properties of Ŷ and ˆ . . . . . . . . . . . . . . . . . . . . . . . . . 100 9.7 Strong OLS hypotheses and testing linear hypotheses in the linear model.100 9.8 “Forecasts” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.9 a note on P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.10 Stochastic X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 9.11 Markowitz and the linear model (this section is not required for the exam)108 311 9.12 Some results useful for the interpretation of estimated coefficients . . . 109 10 Style analysis 169 10.1 Traditional approaches with some connection to style analysis . . . . . 174 10.2 Critiques to style analysis . . . . . . . . . . . . . . . . . . . . . . . . . 176 11 Factor models and principal components 11.1 A very short introduction to linear asset pricing models 11.2 Estimates for B and F . . . . . . . . . . . . . . . . . . 11.3 Maximum variance factors . . . . . . . . . . . . . . . . 11.4 Bad covariance and good components? . . . . . . . . . . . . . 12 Black and Litterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 178 186 191 193 194 13 Appendix: Probabilities as prices for betting 202 13.1 Betting systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 13.2 Probability and frequency . . . . . . . . . . . . . . . . . . . . . . . . . 203 13.3 Probability and behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 205 14 Appendix: Numbers and Maths in Economics 205 15 Appendix: Optimal Portfolio Theory, who invented it? 207 16 Appendix: Gauss Markoff theorem 208 17 Exercises and past exams 211 18 Appendix: Some matrix algebra 214 18.1 Definition of matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 18.2 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 18.3 Rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 18.4 Some special matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 18.5 Determinants and Inverse . . . . . . . . . . . . . . . . . . . . . . . . . 215 18.6 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 18.7 Random Vectors and Matrices (see the following appendix for more details)216 18.8 Functions of Random Vectors (or Matrices) . . . . . . . . . . . . . . . . 216 18.9 Expected Values of Random Vectors . . . . . . . . . . . . . . . . . . . 216 18.10Variance Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . 216 18.11Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 18.12Derivatives of linear functions and quadratic forms . . . . . . . . . . . 217 18.13Minimization of a PD quadratic form, approximate solution of over determined linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 312 18.14Minimization of a PD quadratic form under constraints. Simple applications to Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 18.15The linear model in matrix notation . . . . . . . . . . . . . . . . . . . . 221 19 Appendix: What you cannot ignore about Probability and Statistics 222 19.1 Probability: a Language . . . . . . . . . . . . . . . . . . . . . . . . . . 226 19.2 Interpretations of Probability . . . . . . . . . . . . . . . . . . . . . . . 227 19.3 Probability and Randomness . . . . . . . . . . . . . . . . . . . . . . . . 227 19.4 Different Fields: Physics . . . . . . . . . . . . . . . . . . . . . . . . . . 228 19.5 Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 19.6 Other fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 19.7 Wrong Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 19.8 Meaning of Correct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 19.9 Events and Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 19.10Classes of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 19.11Probability as a Set Function . . . . . . . . . . . . . . . . . . . . . . . 231 19.12Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 19.13Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 19.14Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 19.15Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 232 19.16Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 19.17Properties of the PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 19.18Density and Probability Function . . . . . . . . . . . . . . . . . . . . . 233 19.19Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 19.20Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 19.21Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 19.22Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 19.23Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 19.24Tchebicev Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 19.25*Vysochanskij–Petunin Inequality . . . . . . . . . . . . . . . . . . . . . 235 19.26*Gauss Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 19.27*Cantelli One Sided Inequality . . . . . . . . . . . . . . . . . . . . . . . 236 19.28Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 19.29Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 19.30Subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 19.31Univariate Distributions Models . . . . . . . . . . . . . . . . . . . . . . 237 19.32Some Univariate Discrete Distributions . . . . . . . . . . . . . . . . . . 237 19.33Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 238 19.34Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 238 19.35Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 238 313 19.36Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.37Distribution Function for a Random Vector . . . . . . . . . . . . . . . . 19.38Density and Probability Function . . . . . . . . . . . . . . . . . . . . . 19.39Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.40Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.41Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 19.42Mutual Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.43Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 19.44Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 19.45Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 19.46Law of Iterated Expectations . . . . . . . . . . . . . . . . . . . . . . . 19.47Regressive Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.48Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . 19.49Distribution of the max and the min for independent random variables 19.50Distribution of the max and the min for independent random variables 19.51Distribution of the sum of independent random variables and central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.52Distribution of the sum of independent random variables and central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.53Distribution of the sum of independent random variables and central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.54Why Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.55Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . . 19.56Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . . 19.57No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.58No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.59Learning Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.60Pyramidal Die . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.61Pyramidal Die Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.62Pyramidal Die Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 19.63Many Rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.64Probability of Observing a Sample . . . . . . . . . . . . . . . . . . . . 19.65Pre or Post Observation? . . . . . . . . . . . . . . . . . . . . . . . . . . 19.66Maximize the Probability of the Observed Sample . . . . . . . . . . . . 19.67Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.68Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.69Possibly Different Samples . . . . . . . . . . . . . . . . . . . . . . . . . 19.70The Probability of Our Sample . . . . . . . . . . . . . . . . . . . . . . 19.71The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 19.72The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 19.73The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 314 238 239 239 239 239 240 240 241 241 241 241 242 242 243 243 244 244 244 245 245 246 246 247 247 248 248 249 249 249 250 250 250 251 251 251 252 252 252 19.74The Estimate in Other Possible Samples . . . . . . . . 19.75The Estimate in Other Possible Samples . . . . . . . . 19.76Sampling Variability . . . . . . . . . . . . . . . . . . . 19.77Sampling Variability . . . . . . . . . . . . . . . . . . . 19.78Sampling Variability . . . . . . . . . . . . . . . . . . . 19.79Sampling Variability . . . . . . . . . . . . . . . . . . . 19.80Estimated Sampling Variability . . . . . . . . . . . . . 19.81Quantifying Sampling Variability . . . . . . . . . . . . 19.82Principle 1 . . . . . . . . . . . . . . . . . . . . . . . . . 19.83Principle 2 . . . . . . . . . . . . . . . . . . . . . . . . . 19.84Principle 3 . . . . . . . . . . . . . . . . . . . . . . . . . 19.85The Questions of Statistics . . . . . . . . . . . . . . . . 19.86Statistical Model . . . . . . . . . . . . . . . . . . . . . 19.87Specification of a Parametric Model . . . . . . . . . . . 19.88Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 19.89Parametric Inference . . . . . . . . . . . . . . . . . . . 19.90Different Inferential Tools . . . . . . . . . . . . . . . . 19.91Point Estimation . . . . . . . . . . . . . . . . . . . . . 19.92Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . 19.93Mean Square Error . . . . . . . . . . . . . . . . . . . . 19.94Mean Square Efficiency . . . . . . . . . . . . . . . . . . 19.95Meaning of Efficiency . . . . . . . . . . . . . . . . . . . 19.96Mean Square Consistency . . . . . . . . . . . . . . . . 19.97Mean Square Consistency . . . . . . . . . . . . . . . . 19.98Methods for Building Estimates . . . . . . . . . . . . . 19.99Method of Moments . . . . . . . . . . . . . . . . . . . 19.100Estimation of Moments . . . . . . . . . . . . . . . . . . 19.101Inverting the Moment Equation . . . . . . . . . . . . . 19.102Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 19.103Maximum Likelihood . . . . . . . . . . . . . . . . . . . 19.104Maximum Likelihood . . . . . . . . . . . . . . . . . . . 19.105Interpretation . . . . . . . . . . . . . . . . . . . . . . . 19.106Interpretation . . . . . . . . . . . . . . . . . . . . . . . 19.107Interpretation . . . . . . . . . . . . . . . . . . . . . . . 19.108Interpretation . . . . . . . . . . . . . . . . . . . . . . . 19.109Maximum Likelihood for Densities . . . . . . . . . . . . 19.110Example (Discrete Case) . . . . . . . . . . . . . . . . . 19.111Example Method of Moments . . . . . . . . . . . . . . 19.112Example Maximum likelihood . . . . . . . . . . . . . . 19.113More Advanced Topics . . . . . . . . . . . . . . . . . . 19.114Sampling Standard Deviation and Confidence Intervals 315 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 253 253 254 254 254 255 255 255 256 256 256 257 257 257 257 258 258 258 258 259 259 259 260 260 260 260 261 261 261 262 262 262 263 263 263 263 263 264 264 264 19.115Sampling Variance of the Mean . . . . 19.116Estimation of the Sampling Variance . 19.117nσ Rules . . . . . . . . . . . . . . . . . 19.118Confidence Intervals . . . . . . . . . . 19.119Confidence Intervals . . . . . . . . . . 19.120Confidence Intervals . . . . . . . . . . 19.121Hypothesis testing . . . . . . . . . . . 19.122Parametric Hypothesis . . . . . . . . . 19.123Two Hypotheses . . . . . . . . . . . . . 19.124Simple and Composite . . . . . . . . . 19.125Example . . . . . . . . . . . . . . . . . 19.126Critical Region, Acceptance Region . . 19.127Errors of First and Second Kind . . . . 19.128Power Function and Size of the Errors 19.129Testing Strategy . . . . . . . . . . . . . 19.130Asymmetry . . . . . . . . . . . . . . . 19.131Some Tests . . . . . . . . . . . . . . . 19.132Some Tests . . . . . . . . . . . . . . . 19.133Some Tests . . . . . . . . . . . . . . . 19.134Some Tests . . . . . . . . . . . . . . . 19.135Some Tests . . . . . . . . . . . . . . . 19.136Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 265 265 266 266 266 266 267 267 267 268 268 268 269 269 269 269 270 270 270 270 271 20 *Taylor formula in finance (not for the exam) 20.1 *Taylor’s theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.2 *Remainder term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3 *Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.4 *Taylor formula and Taylor series . . . . . . . . . . . . . . . . . . . . . 20.5 *Taylor formula for functions of several variables . . . . . . . . . . . . . 20.6 *Simple examples of Taylor formula and Taylor theorem in quantitative Economics and Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.7 *Linear and log returns, a reconsideration . . . . . . . . . . . . . . . . 20.8 *Taylor theorem and the connection between linear and log returns . . 20.9 *How big is the error? . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.10*Gordon model and Campbell-Shiller approximation. . . . . . . . . . . 20.11*Remainder term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.12*Dividend price model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.13*What happens if we take the remainder seriously . . . . . . . . . . . . 20.14*Cochrane toy model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 272 272 273 274 274 21 *Appendix: Some further info about the use of regression models 306 316 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 276 279 279 280 285 285 287 289