The advantages and disadvantages of large market data bases

Large Data Bases: Advantages,
Problems and Puzzles: Some
naive observations from an
Alan Kirman,
GREQAM Marseille
Jerusalem September 2008
Some Basic Points
• Economic data bases may be large in two
• Firstly they may simply contain a very large
number of observations. The best examples
being tick by tick data.
• Secondly as with some panel data each
observation may have many dimensions.
The Advantages and Problems
• From a statistical point of view at least the
high frequency data might seem to be
unambiguously advantageous. However the
very nature of the data has to be examined
carefully and certain stylised facts emerge
which are not present at lower frequency
• In the case of multidimensional data, the
« curse of dimensionality » may arise.
FX: A classic example of high
frequency data
• Usually Reuters indicative quotes are used for the
analysis. What do they consist of?
• Banks enter bids and asks for a particular currency
pair, such as the euro-dollar. They put a time
stamp to indicate the exact time of posting
• These quotes are « indicative » and the banks are
not legally obliged to honour them.
• For euro-dollar there are between 10 and 20
thousand updates per day.
Brief Reminder of the
Characteristics of this sort of data
• Returns are given by
rt  St 
Pt  
 St  ln 
 Pt 
• We know that there is no autocorrelation between
successive returns but that rt2 and rt are
positively autocorrelated
except at very small time
intervals and have slow decay
• Volatility exhibits
 spikes
referred to as volatility
A Problem
• The idea of using such data as Brousseau (2007)
points out is to track the « true value » of the
exchange rate through a period.
• But all the data are not of the same « quality »
• Although the quotes hold, at least briefly, between
major banks, they may not do so for other
customers and they may also depend on the
amounts involved.
• There may be mistakes, quotes placed as
« advertising » and one with spreads so large that
they encompass the spread between the best bid
and ask and thus convey no information
Cleaning the Data
• Brousseau and other authors propose various
filtering methods, from simple to sophisticated. If
the jump between two successive mid-points
exceeds a certain threshold for example the
observation is eliminated. ( a primitive first run)
• However, how can one judge whether the filtering
is successful?
• One idea is to test against quotes which are
binding such as those on EBS. But this is not a
Some stylised facts
Microstructure Noise
• In principle, the higher the sampling frequency is, the more
precise the estimates of integrated volatility become
• However, the presence of so-called market microstructure
features at very high sampling frequencies may create
important complications.
• Financial transactions - and hence price changes and nonzero returns- arrive discretely rather than continuously over
• The presence of negative serial correlation of returns to
successive transactions (including the so-called bid-ask
bounce), and the price impact of trades.
• For a discussion see Hasbrouck (2006), O’Hara (1998),
and Campbell et al. (1997, Ch. 3)
Microstructure Noise
• Why should we treat this as « noise » rather than integrate
it into our models?
• One argument is that it overemphasises volatility. In other
words sampling too frequently gives a spuriously high
• On the other hand, Hansen and Lunde (2006) assert that
empirically market microstructure noise is negatively
correlated with the returns, and hence biases the estimated
volatility downward. However, this empirical stylized fact,
based on their analysis of high-frequency stock returns,
does not seem to carry over to the FX market
Microstructure Noise
• « For example, if an organized stock exchange has
designated market makers and specialists, and if these
participants are slow in adjusting prices in response to
shocks (possibly because the exchangeís rules explicitly
prohibit them from adjusting prices by larger amounts all
at once), it may be the case that realized volatility could
drop if it is computed at those sampling frequencies for
which this behavior is thought to be relevant.
• In any case, it is widely recognized that market
microstructure issues can contaminate estimates of
integrated volatility in important ways, especially if the
data are sampled at ultra-high frequencies, as is becoming
more and more common. »
Chaboud et al. (2007)
What do we claim to explain?
Let’s look at rapidly at a standard model and see
how we determine the prices.
• What we claim for this model is that it is the
switching from chartist to fundamentalist
behaviour that leads to
1. Fat tails
2. Long memory
3. Volatility clustering
What does high frequency data have to do with this?
Specifying Ind ividual Behavior
¥ There is a finite set A of agents trading a single risky
¥ The demand function of the agent a  A takes the loglinear formΚ:
eta  p,   : cta Sˆta    log p ta  
where Sˆt and t denote the agentΥscurrent reference
level and liquidity demand, respectively.
 ¥ The logarithmic equilibrium price St := log Pt is defined
 through
 the market clearing condition of zero total
excess demand:
St :  cta Sˆta    t
ct a A
Temporary equilibrium prices are given as a weighted
average of individual price assessments and liquidity
Choosing Ind ividual Assessments
¥ The choice of the reference level is based on the
recommendations of some financial experts:
Sˆta  Rt1 ,..., Rtm 
¥ The fraction of agents following guru i in period t is
given by
 ti :
cta 1 Sˆ a Ri
ct a A  t t 
¥The logarithmic equilibrium price for period t + 1 takes
the form
St    ti Rti  t
Temporary equilibrium prices are given as a weighted
average of
and liquidity demand.
The GurusΥRecommendations
¥ The recommendation of guru i  1,..., mis based on
a subjective assessment Fi of some fundamental value
and a price trend:
Rti : St1  i F i  St1   i St1  St2 
¥ The dynamics of stock prices is governed by the
recursive relation
St  F St1 ,St2 ,  t   1   t     t St1    t St2    t , 
in the random environment  t    t , t 
¥ Unlike in Physics, the environment will be generated
The dynamics of
stock prices is described by a linear
recursive equation in a random environment of investor
sentiment and liquidity demand.
¥ The recommendation of a fundamentalist conveys the
idea that prices move closer to the fundamental value:
Rti : St1   i F i  St1 ,
 i  0,1
¥ If only fundamentalists are active on the market
St  1   t St1    t , t ,
 
and prices behave in a mean-reverting manner because
 i  0,1
¥ The sequence of temporary price equilibria may be
viewed as an Ornstein-Uhlenbeck process in a random
environment. Fundamentalists have a stabilizing effect
on the dynamics
 of stock prices.
¥ A chartist bases his prediction of the future evolution
of stock prices on past observations:
Rti : St1   i St1  St2 ,
 i  0,1
¥ If only chartists are active in the market
 St  St1    t St1  St2   t ,
  t     i  ti
¥ Returns behave in a mean-reverting manner, but prices
are highly transient. Chartists have a destabilizing effect
on the dynamics of stock prices.
The Interactive Effects of Chartists and
¥ If both chartists and fundamentalists are active
St  1   t     t St1    t St2    t ,  t ,
¥ Prices behave in a stable manner in periods where the
impact of chartists is weak enough.
¥ Prices behave in an unstable manner in periods where
the impact of chartists becomes too strong.
¥ Temporary bubbles and crashes occur, due to trend
The overall behavior of the price process turns out to be
ergodic if, on average, the impact of chartists is not too
Performance Measures
How do the agents decide what guru to follow?
¥ The agentsΥpropensity to follow an individual guru
depends on the gurusΥsperformance.
¥ We associate virtualΣ profits with the gurusΥtrading
Pti : Rt1
 St1 e S t  e St1 
¥ The performance of the guru i in period t is given by
U ti : U t1
 Pti   t j Pji
i.e., by a discounted sum of past profits.
The agents adopt the gurusΥrecommendations with
probabilities related to their current performance.
Performance Measures
¥ Propensities to follow individual gurus depend on
 t1 ~ QU t ; where U t  U t1,...,U tm
¥ The better a guruΥsperformance, the more likely the
agents follows
his recommendations.
 ¥ The more agents follow
 a guruΥsrecommendation, the
stronger his
impact on the dynamics of stock prices.
¥ The stronger a guruΥsimpact on the dynamics of stock
prices, the
better his performance.
The dependence of individual choices on performances
generates a
self-reinforcing incentive to follow the currently most
successful guru.
Performance Measures and Feedback Effects
¥ The dynamics of logarithmic stock prices are described
by a linear stochastic difference equation
St  1   t     t St1    t St2    t , t 
in a random environment  t ,  t 
¥ Aggregate liquidity demand is modelled by an
exogenous process.
¥ The dynamics of {Ήt} is generated in an endogenous
¥ The distribution of Ήt depends on all the prices up to
time t-1.
The dependence of individual choices on performances
generates a feedback from past prices into the random
The As sociated Markov Chain
¥ Aggregate liquidity demand follows an iid dynamics.
¥ Stock prices are given by the first component of the
Markov chain
 t  St , St1,U t 
¥ The dynamics of the process  t can be described by
F St ,St1 ,  t 
 t1  V  t ,  t  : 
,  t ~ Z U t ;.
U t  P St , St1 ,  t 
¥ The map St , St1   P St , St1 ,  t  is non-linear.
The dynamics of the price-performance process  t 
can be described by an iterated function system, but
standard methods do not apply.
Stopping the process from exploding
• Bound the probability that an individual can
become a chartist
• If we do not do this the process may simply
• We do not put arbitrary limits on the prices
that can be attained however
Nice Story! But…
Specifying Ind ividual Behavior
¥ There is a finite set A of agents trading a single risky
¥ The demand function of the agent a  A takes the loglinear formΚ:
eta  p,   : cta Sˆta    log p ta  
where Sˆt and t denote the agentΥscurrent reference
level and liquidity demand, respectively.
 ¥ The logarithmic equilibrium price St := log Pt is defined
 through
 the market clearing condition of zero total
excess demand:
St :  cta Sˆta    t
ct a A
Temporary equilibrium prices are given as a weighted
average of individual price assessments and liquidity
The Real Problem
• We have a market clearing equilibrium but this is
not the way these markets function
• They function on the basis of an order book and
that is what we should model.
• Each price in very high frequency data
corresponds to an individual transaction
• The mechanics of the order book will influence
the structure of the time series
• How often do our agents revise their prices?
• They infer information from the actions of others
revealed by the transactions
Qui ckTim e™ an d a
TIFF (Uncompressed ) deco mpre ssor
are need ed to s ee thi s pictu re.
How to solve this?
• This is the subject of a project with Ulrich Horst
• We will model an arrival process for orders and
the distribution from which these orders are drawn
will be determined by the movements of prices
• In this way we model directly what is too often
referred to as « microstructure noise » and remove
one of the problems with using high frequency
A Challenge
« In deep and liquid markets, market microstructure noise should pose
less of a concern for volatility estimation. It should be possible to
sample returns on such assets more frequently than returns on
individual stocks, before estimates of integrated volatility encounter
significant bias caused by the market microstructure features.. It is
possible to sample the FX data as often as once every 15 to 20 seconds
without the standard estimator of integrated volatility showing
discernible effects stemming rom market microstructure noise. This
interval is shorter than the sampling intervals of several minutes,
usually five or more minutes, often recommended in the empirical
This shorter sampling interval and associated larger sample size
affords a considerable gain in estimation precision. In very deep and
liquid markets, microstructure-induced frictions may be much less of
an issue for volatility estimation than was previously thought. »
Chaboud et al. (2007)
Our job is to explain why this is
The Curse of Dimensionality
The colorful phrase the ‘curse of dimensionalit y’ was apparently coined by Richard Belman
in [3], in connection with the difficulty of optimization by exhaustive enumeration on
product spaces. Bellman reminded us that, if we consider a cartesian grid of spacing 1/10
on the unit c ube in 10 dimensions, we have 1010 points; if the cube in 20 dimensions was
considered, we would have of course 1020 points. His interpretation: if our goal is to optimize
a function over a continuous product domain of a few dozen variables by exhaustively
searching a discrete search space defined by a crude discretization, we could easily be faced
with the problem of making tens of trilli ons of evaluations of the function. Bellman argued
that this curse precluded, under almost any computational scheme then foreseeable, the
use of exhaustive enumeration strategies, and argued in favor of his method of dynamic
Why does this matter?
• We collect more and more data on individuals and,
in particular, on consumers and the unemployed
• If we have D observations on N individuals the
relationship between D and N is important if we
wish to estimate some functional relation between
the variables
• There is now a whole battery of approaches for
reducing the dimensionality of the problem and
these represent a major challenge for econometrics
A blessing?
• Mathematicians assert that such high
dimensionality leads to a « concentration of
measure »
• Someone here can no doubt explain how
this might help economists!