High-Dimensional Inference with Applications University of Kent, 24/25 June 2013 Keynote speakers/abstracts 1. Bayesian Inference in Finite and Infinite Dimensions Professor Philip Dawid, Cambridge When the parameter-space is finite-dimensional, all reasonable prior distributions are mutually absolutely continuous: this implies that their associated posterior or predictive distributions will become indistinguishable as more data accrue, so leading to essentially “objective” Bayesian inference in large samples. By contrast, in nonparametric problems, involving infinite-dimensional parameter spaces, distinct prior distributions are generically mutually singular: and this implies that no amount of data can bring their inferences into agreement. I will discuss some issues associated with this, with special attention to the need for caution when specifying prior distributions on infinite-dimensional spaces. 2. BS in Britain: Mitigating the effects of preferentially selected monitoring sites for inference and policy Professor Jim Zidek, University of British Columbia, Canada Co-author: Dr Gavin Shaddick, Bath In the 1960s, over 2000 sites in the UK monitored black smoke (BS) air pollution due to concerns about its effect on public health that were clearly demonstrated by the famous London fog of 1952. Abatement measures led to a decline in the levels of BS and hence a reduction in the number of monitoring sites to less than 200 by 1996. Treating the BS example as a case study, the speaker will argue that the sites to be removed were preferentially selected, causing estimates of metrics used by regulatory agencies to be too high. Moreover he will describe an approach to mitigating the effects of preferential sampling. The large number of monitoring sites and their associated high dimensional data vectors rules out naïve use of classical geostatistical methods in doing so and hence the need for novel approaches in analysis, which will be described. The work has important general implications for the setting of regulatory standards and the design of monitoring networks. Most importantly it points anew to the importance of good design in statistical measurement and testing. 3. Bayesian Models for Integrative Genomics Professor Marina Vannucci, Rice University, Houston, Texas Novel methodological questions are now being generated in bioinformatics and require the integration of different concepts, methods, tools and data types. Bayesian methods that employ variable selection have been particularly successful for genomic applications, as they allow to handle situations where the amount of measured variables can be much greater than the number of observations. In this talk I will first describe Bayesian variable selection methods for linear settings that incorporate external biological information into the analysis of gene expression data. I will then focus on models that achieve an even greater type of integration, by incorporating into the modelling experimental data from different platforms, together with prior knowledge. I will look in particular at graphical models, integrating gene expression data with microRNA expression data, and at a hierarchical mixture model for imaging genetics data, incorporating functional MRI data and genetic information measured on a same set of patients. All modelling settings employ variable selection techniques and prior constructions that cleverly incorporate biological knowledge about structural dependencies among the variables. 4. MaTaDOR: Bayesian Object Regression for Complex, High Dimensional Data Professor Jeffrey S. Morris, University of Texas MD Anderson Cancer Center The term “object data” is a generalization of “functional data”, and could be defined to involve multiple measurements on some type of structured space, and include, for example, functions, images, shapes, graphs, and trees. The internal structure of the objects can be based on geometry or more complex scientific relationships, and efficient statistical methods should take this internal structure into account. In this talk, I will discuss MaTaDOR (MulTi-Domain Object Regression), a very general and flexible modelling framework that is a generalization of functional mixed models that can be used to perform unified Bayesian regression analyses on a broad array of such object data while taking various types of internal structure into account in a flexible way. Our strategy involves the use of various types of basis functions to capture objects’ internal structure, using a modelling strategy that is conducive to parallel processing and scales up to very large data sets. I will discuss some specific methods developed within this framework (some of which were done in collaboration with Phil Brown), including robust object regression using scale mixture distributions, object classification using predictive probabilities, and nonparametric additive models for object data. 5. Making Bayesian Mixture Models Identifiable Professor Stephen Walker, SMSAS, Kent We are interested in making the Bayesian mixture model identifiable. It is known that the model with weights and locations is not identifiable for these parameters. Hence, the latent allocation variables are also difficult to interpret making clustering problematic. In this talk we endeavour to make the model identifiable by marginalizing over the weights and locations. This leaves a model for the observations given the allocations and hence a prior for the allocations is needed. We propose a form of prior which provides explicit interpretation for the allocations. Supporting theory and illustrations are presented. 6. Election-night Forecasting for the BBC: Statisticians versus Presenters and Swingometers Dr Clive Payne, Oxford A review of the statistical methods used in BBC election-night forecasts in the last 40 years with particular emphasis on the Brown-Payne ridge regression method used in General Elections 1974-2002 and on the exit poll-based methods used more recently. The paper will include discussion of the media aspects of the presentation of predictions. 7. A statistical framework for comparison of climate simulation models with past climate observations, including a calibration problem Professor Rolf Sundberg, Stockholm The variability in a climate simulation model and in actual observational data have in common only the possible responses to so called forcings, built into the model. Forcings are more or less well-documented factors thought to have affected the climate, such as variation in planet orbit, solar strength, land use, and green-house gases. Climate observations are of instrumental (relatively recent) or proxy type. The latter are surrogates of instrumental measurements, based on for example tree rings or various kinds of sediments. We will formulate a statistical framework aimed at comparison of climate models with differently built in forcing effects, with respect to the model's ability to fit past climate data. The framework will require a discussion of calibration of proxies. 8. An adaptive MCMC scheme for variable selection problems Professor Jim Griffin Co-authors: Dr Krzysztof Łatuszyński (Warwick) and Professor Mark Steel (Warwick) Data sets with many variables are routinely collected in many disciplines. This has led to interest in variable selection in regression models with a large number of variables. A standard Bayesian approach defines a prior on the model space (defined by all subsets of the variables) and uses Markov chain Monte Carlo methods to explore the space. Unfortunately, the size of the space ($2^p$ if there are $p$ variables) and the use of simple proposals in Metropolis-Hastings steps has led to samplers that often mix poorly. In this talk, I will describe an adaptive Metropolis-Hastings scheme which adapts a wide-class of proposals to the posterior distribution. This leads to orders of magnitude improvements in the mixing over standard algorithms. The methods will be illustrated on several real regression problems. 9. A Benefit–Risk Analysis of using Formal Benefit-Risk Approaches for Decision-Making in Drug Regulation Professor Deborah Ashby, School of Public Health, Imperial College London For a medicine to gain a license it requires evidence of its efficacy and safety. Study designs and statistical methods are well developed to deal with the former, and to a lesser extent the latter. However, until recently, assessment of the benefit-risk balance for a medicine, especially in relation to alternatives, has been entirely informal. There is now growing interest among drug regulators and pharmaceutical companies in the possibilities of more formal approaches to benefit-risk decision-making. In this talk, we review the basis of drug regulation, the established statistical bases for decision-making under uncertainty, and current initiatives in the area. One such initiative forms part of the Pharmacoepidemiological Research on Outcomes of Therapeutics by a European Consortium (PROTECT) project, which is funded under the Innovative Medicines Initiative and is a collaboration between academic, pharmaceutical, regulatory and patient organizations. Based on work from this project we will review current methodological approaches, and illustrate them with case-studies on medicines where benefit-risk is finely balanced. The use of formal decision-making in this context is not uncontroversial, so we end with a, somewhat informal and personal, appraisal of the benefits and risks of taking this path. 10. Using decision theory to explore posterior models in cancer genomics Professor Chris Holmes, Oxford Bayesian models have proved highly useful in the analysis of genetic variation arising in cancers. We have previously developed Bayesian Hidden Markov Models for this task where the hidden states relate to structural changes in DNA known to be key drivers of cancer initialisation and progression. In this talk we discuss the use of decision theory to help elicit knowledge contained in the posterior models. That is, having conditioned a model on data how can we explore the posterior model for interesting, highly probable, state sequences? 11. Variable Selection via EMVS Professor Ed George, Wharton, Pennsylvania Co-author: Veronika Rockova, Erasmus University, Rotterdam Despite rapid developments in stochastic search algorithms, the practicality of Bayesian variable selection has continued to pose challenges. Highdimensional data are now routinely analyzed, typically with many more covariates than observations. To broaden the applicability of Bayesian variable selection for such contexts, we propose EMVS, a deterministic alternative to stochastic search based on an EM algorithm that quickly finds posterior modes over a nested sequence of continuous conjugate spike-and-slab priors. Summarizing such dynamic posterior exploration with a regularization diagram, rigorous evaluation by posterior model probabilities is used to identify the most promising sparse submodels. External structural information such as likely covariate groupings or network topologies is easily incorporated into the EMVS framework. Deterministic annealing variants are seen to improve search effectiveness by mitigating posterior multimodality. Both univariate and multivariate regression examples will be used to illustrate EMVS in action. 12. Smooth supersaturated models: SSM Professor Henry Wynn, London School of Economics Polynomial models are at the centre of the emerging field of Algebraic Statistics and using abstract algebra, particularly ideal theory, leads to greater understanding of classical issues such as aliasing. However, raw polynomial regression models have poor smoothness as soon as one has higher degree than two. With an elementary procedure one can produce high polynomial models which have increased smoothness but for which the degree p is greater than the sample size n. As p gets larger one approaches spline kernels, while, for finite p, retaining the property of being analytic. The algebraic methods aids the careful extension of the degree. Applications are to computer experiments, optimal design and sensitivity analysis where being analytic has mathematical advantages. The methods are competitive with popular method such as Gaussian kriging.