Large Data Bases: Advantages, Problems and Puzzles: Some naive observations from an economist Alan Kirman, GREQAM Marseille Jerusalem September 2008 Some Basic Points • Economic data bases may be large in two ways • Firstly they may simply contain a very large number of observations. The best examples being tick by tick data. • Secondly as with some panel data each observation may have many dimensions. The Advantages and Problems • From a statistical point of view at least the high frequency data might seem to be unambiguously advantageous. However the very nature of the data has to be examined carefully and certain stylised facts emerge which are not present at lower frequency • In the case of multidimensional data, the « curse of dimensionality » may arise. FX: A classic example of high frequency data • Usually Reuters indicative quotes are used for the analysis. What do they consist of? • Banks enter bids and asks for a particular currency pair, such as the euro-dollar. They put a time stamp to indicate the exact time of posting • These quotes are « indicative » and the banks are not legally obliged to honour them. • For euro-dollar there are between 10 and 20 thousand updates per day. Brief Reminder of the Characteristics of this sort of data • Returns are given by rt St Pt St ln Pt • We know that there is no autocorrelation between successive returns but that rt2 and rt are positively autocorrelated except at very small time intervals and have slow decay • Volatility exhibits spikes referred to as volatility clustering A Problem • The idea of using such data as Brousseau (2007) points out is to track the « true value » of the exchange rate through a period. • But all the data are not of the same « quality » • Although the quotes hold, at least briefly, between major banks, they may not do so for other customers and they may also depend on the amounts involved. • There may be mistakes, quotes placed as « advertising » and one with spreads so large that they encompass the spread between the best bid and ask and thus convey no information Cleaning the Data • Brousseau and other authors propose various filtering methods, from simple to sophisticated. If the jump between two successive mid-points exceeds a certain threshold for example the observation is eliminated. ( a primitive first run) • However, how can one judge whether the filtering is successful? • One idea is to test against quotes which are binding such as those on EBS. But this is not a guarantee. Some stylised facts Microstructure Noise • In principle, the higher the sampling frequency is, the more precise the estimates of integrated volatility become • However, the presence of so-called market microstructure features at very high sampling frequencies may create important complications. • Financial transactions - and hence price changes and nonzero returns- arrive discretely rather than continuously over time • The presence of negative serial correlation of returns to successive transactions (including the so-called bid-ask bounce), and the price impact of trades. • For a discussion see Hasbrouck (2006), O’Hara (1998), and Campbell et al. (1997, Ch. 3) Microstructure Noise • Why should we treat this as « noise » rather than integrate it into our models? • One argument is that it overemphasises volatility. In other words sampling too frequently gives a spuriously high value. • On the other hand, Hansen and Lunde (2006) assert that empirically market microstructure noise is negatively correlated with the returns, and hence biases the estimated volatility downward. However, this empirical stylized fact, based on their analysis of high-frequency stock returns, does not seem to carry over to the FX market Microstructure Noise • « For example, if an organized stock exchange has designated market makers and specialists, and if these participants are slow in adjusting prices in response to shocks (possibly because the exchangeís rules explicitly prohibit them from adjusting prices by larger amounts all at once), it may be the case that realized volatility could drop if it is computed at those sampling frequencies for which this behavior is thought to be relevant. • In any case, it is widely recognized that market microstructure issues can contaminate estimates of integrated volatility in important ways, especially if the data are sampled at ultra-high frequencies, as is becoming more and more common. » Chaboud et al. (2007) What do we claim to explain? • Let’s look at rapidly at a standard model and see how we determine the prices. • What we claim for this model is that it is the switching from chartist to fundamentalist behaviour that leads to 1. Fat tails 2. Long memory 3. Volatility clustering What does high frequency data have to do with this? Specifying Ind ividual Behavior ¥ There is a finite set A of agents trading a single risky asset. ¥ The demand function of the agent a A takes the loglinear formΚ: eta p, : cta Sˆta log p ta a a where Sˆt and t denote the agentΥscurrent reference level and liquidity demand, respectively. ¥ The logarithmic equilibrium price St := log Pt is defined through the market clearing condition of zero total excess demand: 1 St : cta Sˆta t ct a A Temporary equilibrium prices are given as a weighted average of individual price assessments and liquidity demand. Choosing Ind ividual Assessments ¥ The choice of the reference level is based on the recommendations of some financial experts: Sˆta Rt1 ,..., Rtm ¥ The fraction of agents following guru i in period t is given by ti : 1 cta 1 Sˆ a Ri ct a A t t ¥The logarithmic equilibrium price for period t + 1 takes the form m St ti Rti t i1 Temporary equilibrium prices are given as a weighted average of recommendations and liquidity demand. The GurusΥRecommendations ¥ The recommendation of guru i 1,..., mis based on a subjective assessment Fi of some fundamental value and a price trend: Rti : St1 i F i St1 i St1 St2 ¥ The dynamics of stock prices is governed by the recursive relation St F St1 ,St2 , t 1 t t St1 t St2 t , in the random environment t t , t ¥ Unlike in Physics, the environment will be generated endogenously. The dynamics of stock prices is described by a linear recursive equation in a random environment of investor sentiment and liquidity demand. Fundamentalists ¥ The recommendation of a fundamentalist conveys the idea that prices move closer to the fundamental value: Rti : St1 i F i St1 , i 0,1 ¥ If only fundamentalists are active on the market St 1 t St1 t , t , m i i t i1 and prices behave in a mean-reverting manner because i 0,1 ¥ The sequence of temporary price equilibria may be viewed as an Ornstein-Uhlenbeck process in a random environment. Fundamentalists have a stabilizing effect on the dynamics of stock prices. Chartists ¥ A chartist bases his prediction of the future evolution of stock prices on past observations: Rti : St1 i St1 St2 , i 0,1 ¥ If only chartists are active in the market St St1 t St1 St2 t , m t i ti i1 ¥ Returns behave in a mean-reverting manner, but prices are highly transient. Chartists have a destabilizing effect on the dynamics of stock prices. The Interactive Effects of Chartists and Fundamentalists ¥ If both chartists and fundamentalists are active St 1 t t St1 t St2 t , t , ¥ Prices behave in a stable manner in periods where the impact of chartists is weak enough. ¥ Prices behave in an unstable manner in periods where the impact of chartists becomes too strong. ¥ Temporary bubbles and crashes occur, due to trend chasing. The overall behavior of the price process turns out to be ergodic if, on average, the impact of chartists is not too strong. Performance Measures How do the agents decide what guru to follow? ¥ The agentsΥpropensity to follow an individual guru depends on the gurusΥsperformance. ¥ We associate virtualΣ profits with the gurusΥtrading strategies: i Pti : Rt1 St1 e S t e St1 ¥ The performance of the guru i in period t is given by t i U ti : U t1 Pti t j Pji j0 i.e., by a discounted sum of past profits. The agents adopt the gurusΥrecommendations with probabilities related to their current performance. Performance Measures ¥ Propensities to follow individual gurus depend on performances: t1 ~ QU t ; where U t U t1,...,U tm ¥ The better a guruΥsperformance, the more likely the agents follows his recommendations. ¥ The more agents follow a guruΥsrecommendation, the stronger his impact on the dynamics of stock prices. ¥ The stronger a guruΥsimpact on the dynamics of stock prices, the better his performance. The dependence of individual choices on performances generates a self-reinforcing incentive to follow the currently most successful guru. Performance Measures and Feedback Effects ¥ The dynamics of logarithmic stock prices are described by a linear stochastic difference equation St 1 t t St1 t St2 t , t in a random environment t , t ¥ Aggregate liquidity demand is modelled by an exogenous process. ¥ The dynamics of {Ήt} is generated in an endogenous manner. ¥ The distribution of Ήt depends on all the prices up to time t-1. The dependence of individual choices on performances generates a feedback from past prices into the random environment. The As sociated Markov Chain ¥ Aggregate liquidity demand follows an iid dynamics. ¥ Stock prices are given by the first component of the Markov chain t St , St1,U t ¥ The dynamics of the process t can be described by F St ,St1 , t t1 V t , t : St , t ~ Z U t ;. U t P St , St1 , t ¥ The map St , St1 P St , St1 , t is non-linear. The dynamics of the price-performance process t can be described by an iterated function system, but standard methods do not apply. Stopping the process from exploding • Bound the probability that an individual can become a chartist • If we do not do this the process may simply explode • We do not put arbitrary limits on the prices that can be attained however Nice Story! But… Specifying Ind ividual Behavior ¥ There is a finite set A of agents trading a single risky asset. ¥ The demand function of the agent a A takes the loglinear formΚ: eta p, : cta Sˆta log p ta a a where Sˆt and t denote the agentΥscurrent reference level and liquidity demand, respectively. ¥ The logarithmic equilibrium price St := log Pt is defined through the market clearing condition of zero total excess demand: 1 St : cta Sˆta t ct a A Temporary equilibrium prices are given as a weighted average of individual price assessments and liquidity demand. The Real Problem • We have a market clearing equilibrium but this is not the way these markets function • They function on the basis of an order book and that is what we should model. • Each price in very high frequency data corresponds to an individual transaction • The mechanics of the order book will influence the structure of the time series • How often do our agents revise their prices? • They infer information from the actions of others revealed by the transactions Qui ckTim e™ an d a TIFF (Uncompressed ) deco mpre ssor are need ed to s ee thi s pictu re. How to solve this? • This is the subject of a project with Ulrich Horst • We will model an arrival process for orders and the distribution from which these orders are drawn will be determined by the movements of prices • In this way we model directly what is too often referred to as « microstructure noise » and remove one of the problems with using high frequency data. A Challenge « In deep and liquid markets, market microstructure noise should pose less of a concern for volatility estimation. It should be possible to sample returns on such assets more frequently than returns on individual stocks, before estimates of integrated volatility encounter significant bias caused by the market microstructure features.. It is possible to sample the FX data as often as once every 15 to 20 seconds without the standard estimator of integrated volatility showing discernible effects stemming rom market microstructure noise. This interval is shorter than the sampling intervals of several minutes, usually five or more minutes, often recommended in the empirical literature This shorter sampling interval and associated larger sample size affords a considerable gain in estimation precision. In very deep and liquid markets, microstructure-induced frictions may be much less of an issue for volatility estimation than was previously thought. » Chaboud et al. (2007) Our job is to explain why this is so! The Curse of Dimensionality The colorful phrase the ‘curse of dimensionalit y’ was apparently coined by Richard Belman in [3], in connection with the difficulty of optimization by exhaustive enumeration on product spaces. Bellman reminded us that, if we consider a cartesian grid of spacing 1/10 on the unit c ube in 10 dimensions, we have 1010 points; if the cube in 20 dimensions was considered, we would have of course 1020 points. His interpretation: if our goal is to optimize a function over a continuous product domain of a few dozen variables by exhaustively searching a discrete search space defined by a crude discretization, we could easily be faced with the problem of making tens of trilli ons of evaluations of the function. Bellman argued that this curse precluded, under almost any computational scheme then foreseeable, the use of exhaustive enumeration strategies, and argued in favor of his method of dynamic programming. Why does this matter? • We collect more and more data on individuals and, in particular, on consumers and the unemployed • If we have D observations on N individuals the relationship between D and N is important if we wish to estimate some functional relation between the variables • There is now a whole battery of approaches for reducing the dimensionality of the problem and these represent a major challenge for econometrics A blessing? • Mathematicians assert that such high dimensionality leads to a « concentration of measure » • Someone here can no doubt explain how this might help economists!