Dealing with large datasets (by throwing away most of the data) Alan Heavens Institute for Astronomy, University of Edinburgh with Ben Panter, Rob Tweedie, Mark Bastin, Will Hossack, Keith McKellar, Trevor Whittley Data-intensive Research Workshop, NeSC Mar 15 2010 Tuesday, 16 March 2010 Data Data: a list of measurements Quantitative Write data as a vector x Data have errors (referred to as noise) n Tuesday, 16 March 2010 Modelling In this talk, I will assume there is a good model for the data, and the noise Model: a theoretical framework Model typically has parameters in it Forward modelling: given a model, and values for the parameters, we calculate what the expected value of the data are: μ = 〈x〉 Key quantity: noise covariance matrix: Cij = 〈ninj〉 Tuesday, 16 March 2010 Examples Galaxy spectra: flux measurements are sum total of starlight from stars of given age (simplest) 2 parameters = age, mass Tuesday, 16 March 2010 Example: cosmology Model: Big Bang theory Parameters: Expansion rate, density of ordinary matter, density of dark matter, dark energy content... (around 15 parameters) Scale Tuesday, 16 March 2010 Inverse problem Parameter Estimation: given some data, and a model, what are the most likely values of the parameters, and what are there errors? Model Selection: given some data, which is the most likely model? (Big Bang vs Steady State, or Braneworld model) Tuesday, 16 March 2010 Best-fitting parameters Usually minimise a penalty function. For gaussian errors, and no prior prejudice about the parameters, minimise χ2: Cij = �ni nj � Brute-force minimisation may be slow: dataset size N may be large (N, N2 or N3 scaling), or parameter space may be large (exponential dependence) Tuesday, 16 March 2010 � (xi − µi )2 χ2 = σi2 data i χ2 = � data i,j −1 (xi − µi ) Cij (xj − µj ) Dealing with large parameter spaces Don’t explore it all - generate a chain of random points in parameter space Most common technique is MCMC (Markov Chain Monte Carlo) Asymptotically, the density of points is proportional to the probability of the parameter Generate a posterior distribution: prob(parameters | data) (Most astronomical analysis is Bayesian) Tuesday, 16 March 2010 Variants: Hamiltonian Monte Carlo, Nested Sampling,Gibbs Sampling... Large data sets What scope is there to reduce the size of the dataset? Why? Faster analysis Can we do this without losing accuracy? Depending on where the information is coming from, often the answer is yes. Tuesday, 16 March 2010 Karhunen-Loeve compression If information comes from the scatter about the mean, then if the data have variable noise, or are correlated, we can perform linear compression: y=Bx Construct matrix B so that elements of y are uncorrelated, and ordered in increasing uselessness σ −2 = � modes k Tuesday, 16 March 2010 σk−2 Limited compression usually possible Vogeley & Szalay 1995, Tegmark, Taylor & Heavens 1997 Quadratic compression more effective Fisher Matrix How is B determined? Key concept is the Fisher Matrix - gives the expected parameter errors Steps diagonalise the covariance matrix C of the data divide each mode by its r.m.s. (so now C=I) Rotate again until each mode gives uncorrelated information on a parameter (generalised eigenvalue problem) Order the modes, and throw the worst ones away Tuesday, 16 March 2010 ∂C b = λCb ∂pi Massive Data Compression If the information comes from the mean, rather than the scatter, much more radical surgery is possible Dataset of size N can be reduced to size M (= number of parameters), sometimes without loss of accuracy This can be a massive reduction: e.g. galaxy spectrum 2000 →10 Microwave background power spectrum 2000 → 15 Likelihood calculations are at least N/M times faster Tuesday, 16 March 2010 MOPED* algorithm (patented) Consider a weight vector y1=b1.x Choose b1 such that the likelihood of y1 is as sharply peaked as possible, in the direction of parameter 1 Repeat (subject to some constraints) for all M parameters Dataset reduced to size M (independent of N) - scaling * Massively-Optimised Parameter Estimation and Data compression Tuesday, 16 March 2010 MOPED weighting vectors MOPED automatically calculates the optimum weights for each data point In many cases, the errors from the compressed dataset are no larger than those from the entire dataset It is NOT obvious that this is possible Example: set of data points, from which you want to estimate the mean (of the population from which the sample is drawn). If all errors are the same, then b = (1/N, 1/N, ...) i.e. average the data. Tuesday, 16 March 2010 Examples Galaxy Spectra ~100,000 galaxies Data compressed by 99% Analysis time reduced from 400 years to a few weeks MOPED weighting vectors Tuesday, 16 March 2010 Medical imaging: registration Stroke lesion Image Distortions MRI scans: 512x512x100 voxels N= 2.6 x 107 Affine distortions: M = 12 Tuesday, 16 March 2010 Summary Astronomical Datasets can be large, but the set of interesting quantities may be small With a good model for the data, carefully-designed (and massive) data compression can hugely speed up analysis, with no loss of accuracy Such a situation is quite typical - applications elsewhere - Blackford Analysis stand in the Research Village Tuesday, 16 March 2010