Dealing with large datasets (by throwing away most of the data)

Dealing with large datasets
(by throwing away most of the data)
Alan Heavens
Institute for Astronomy, University of Edinburgh
with Ben Panter, Rob Tweedie, Mark Bastin, Will Hossack, Keith McKellar, Trevor Whittley
Data-intensive Research Workshop, NeSC Mar 15 2010
Tuesday, 16 March 2010
Data: a list of measurements
Write data as a vector x
Data have errors (referred to
as noise) n
Tuesday, 16 March 2010
In this talk, I will assume there is a good model for the
data, and the noise
Model: a theoretical framework
Model typically has parameters in it
Forward modelling: given a model, and values for
the parameters, we calculate what the expected value
of the data are: μ = 〈x〉
Key quantity: noise covariance matrix: Cij = 〈ninj〉
Tuesday, 16 March 2010
Galaxy spectra: flux measurements are sum total of
starlight from stars of given age (simplest)
2 parameters = age, mass
Tuesday, 16 March 2010
Example: cosmology
Model: Big Bang theory
Parameters: Expansion rate, density of ordinary matter,
density of dark matter, dark energy content... (around
15 parameters)
Tuesday, 16 March 2010
Inverse problem
Parameter Estimation: given some data, and a
model, what are the most likely values of the
parameters, and what are there errors?
Model Selection: given some data, which is the most
likely model? (Big Bang vs Steady State, or
Braneworld model)
Tuesday, 16 March 2010
Best-fitting parameters
Usually minimise a penalty
function. For gaussian
errors, and no prior
prejudice about the
parameters, minimise χ2:
Cij = �ni nj �
Brute-force minimisation
may be slow: dataset size
N may be large (N, N2 or
N3 scaling), or parameter
space may be large
(exponential dependence)
Tuesday, 16 March 2010
� (xi − µi )2
χ2 =
data i
χ2 =
data i,j
(xi − µi ) Cij
(xj − µj )
Dealing with large parameter
Don’t explore it all - generate a chain of
random points in parameter space
Most common technique is MCMC
(Markov Chain Monte Carlo)
Asymptotically, the density of points is
proportional to the probability of the
Generate a posterior distribution:
prob(parameters | data)
(Most astronomical analysis is
Tuesday, 16 March 2010
Variants: Hamiltonian
Monte Carlo, Nested
Large data sets
What scope is there to reduce the size of the dataset?
Why? Faster analysis
Can we do this without losing accuracy?
Depending on where the information is coming from,
often the answer is yes.
Tuesday, 16 March 2010
If information comes from the
scatter about the mean, then if the
data have variable noise, or are
correlated, we can perform linear
Construct matrix B so that
elements of y are uncorrelated,
and ordered in increasing
modes k
Tuesday, 16 March 2010
Limited compression usually possible
Vogeley & Szalay 1995, Tegmark, Taylor & Heavens 1997
Quadratic compression more effective
Fisher Matrix
How is B determined? Key concept is
the Fisher Matrix - gives the expected
parameter errors
diagonalise the covariance matrix C of the data
divide each mode by its r.m.s. (so now C=I)
Rotate again until each mode gives uncorrelated
information on a parameter (generalised
eigenvalue problem)
Order the modes, and throw the worst ones
Tuesday, 16 March 2010
b = λCb
Massive Data Compression
If the information comes from the mean, rather than the
scatter, much more radical surgery is possible
Dataset of size N can be reduced to size M (= number
of parameters), sometimes without loss of accuracy
This can be a massive reduction: e.g.
galaxy spectrum 2000 →10
Microwave background power spectrum 2000 → 15
Likelihood calculations are at least N/M times faster
Tuesday, 16 March 2010
MOPED* algorithm
Consider a weight vector y1=b1.x
Choose b1 such that the likelihood of
y1 is as sharply peaked as possible, in
the direction of parameter 1
Repeat (subject to some constraints)
for all M parameters
Dataset reduced to size M
(independent of N) - scaling
* Massively-Optimised Parameter Estimation and Data compression
Tuesday, 16 March 2010
MOPED weighting vectors
MOPED automatically calculates the optimum weights
for each data point
In many cases, the errors from the compressed dataset
are no larger than those from the entire dataset
It is NOT obvious that this is possible
Example: set of data points, from which you want to
estimate the mean (of the population from which the
sample is drawn). If all errors are the same, then b =
(1/N, 1/N, ...) i.e. average the data.
Tuesday, 16 March 2010
Galaxy Spectra
~100,000 galaxies
Data compressed by 99%
Analysis time reduced from
400 years to a few weeks
MOPED weighting vectors
Tuesday, 16 March 2010
Medical imaging: registration
Stroke lesion
Image Distortions
MRI scans:
512x512x100 voxels
N= 2.6 x 107
Affine distortions: M = 12
Tuesday, 16 March 2010
Astronomical Datasets can be large, but the set of
interesting quantities may be small
With a good model for the data, carefully-designed
(and massive) data compression can hugely speed up
analysis, with no loss of accuracy
Such a situation is quite typical - applications
elsewhere - Blackford Analysis stand in the Research
Tuesday, 16 March 2010
Related documents