BIG DATA IN BIOMEDICINE. BIG MODELS? WORKSHOP PROGRAM Session 1 (9:30 - 11:30). Room LIB1, Library building 9:30 David Rossell (University of Warwick, UK) Welcome address 9:40 Keynote: Valen Johnson (Texas A&M University, USA). Title: Implications of uniformly most powerful Bayesian tests for the reproducibility of scientific research Abstract: Connections between uniformly most powerful Bayesian tests and classical uniformly most powerful tests are used to equate the rejection regions of these two types of tests in order obtain an approximate equivalence between Bayes factors and p-values. The implications of this equivalence for the reproducibility of scientific research are examined. Approximately uniformly most powerful Bayesian tests are described for t tests, and the power of these tests are compared to ideal Bayes factors (defined by determining the best test alternative for each true value of the parameter), as well as to Bayes factors obtained using the true parameter value as the alternative. Interpretations and asymptotic properties of these Bayes factors are also discussed. 10:20 Rich Savage (University of Warwick, UK) Title: The data deluge and what to do with it Abstract: Medicine has become a data-rich subject. This opens up huge possibilities for diagnostic/prognostic prediction as a tool for doctors to help patients. But to do this, we must develop ways to use effectively a hugely diverse range of data sources. I’ll discuss the current state-of-the-art in medical bioinformatics and take a look at some of the future challenges that we’re about to face. 10:50 Jun Liu (Harvard University, USA) Title: Bayesian inverse multivariate regression with application to eQTL analysis Abstract: Expression quantitative trait loci (eQTLs) are genomic locations associated with changes of expression levels of certain genes. By assaying gene expressions and genetic variations simultaneously on a genome-wide scale, scientists wish to discover genomic loci responsible for expression variations of a set of genes. The task can be viewed as a multivariate regression problem with variable selection on both responses (gene expression) and covariates (genetic variations), including also multi-way interactions among correlated covariates. Instead of learning a predictive model of quantitative trait given combinations of genetic markers, we adopt an inverse modeling perspective to model the distribution of genetic markers conditional on certain quantitative traits. A particular strength of our method is its ability to detect interactive effects of genetic variations with high power even when their marginal effects are weak, addressing a key weakness of many existing eQTL mapping methods. Furthermore, we illustrate how the methodology can be adapted to the general semi-parametric regression framework based on index modeling, and how to design a non-iterative step-wise algorithm to discover multi-way interactions among predictors in O(p) time, with p being the number of predictors in consideration. This is based on the joint work with Bo Jiang. 11:20 Floor discussion Break (11:30 - 11:50) Session 2 (11:50 - 13:00). Room LIB1, Library building 11:50 Mark Girolami (University of Warwick, UK) Title: Bayesian Model Selection: An Aid to the Cell Biologist? Abstract: Understanding the mechanisms of cell signal transduction holds the promise of new targeted cancer therapies as suggested by the success of specific Tyrosine Kinase Inhibitors. Formal models of biochemical kinetics describing hypothesised components of signalling pathway structures provide a means of assessing their evidential support by data assimilation. Bayesian model selection is a conceptually simple and appealing framework with which to evidentially assess competing hypothesised structures and this talk will discuss how it may aid the scientific process in cellular biology. In addition to considering a number of conceptual, methodological and technical issues that arise two ongoing case studies, being undertaken with cancer biologists, based on (1) an investigation into the regulation of the EWS-FL1 translocation fusion protein associated with Ewings Sarcoma, and (2) the interplay of RAF isoforms in the MAPK pathway are presented. 12:20 Peter Mueller (UT Austin, USA) Title: A Bayesian Feature Allocation Model for Tumor Heterogeneity Abstract: We characterize tumor variability by hypothetical latent cell types that are defined by the presence of some subset of recorded SNV’s. (single nucleotide variants, that is, point mutations). Assuming that each sample is composed of some samplespecific proportions of these cell types we can then fit the observed proportions of SNV’s for each sample. In other words, by fitting the observed proportions of SNV’s in each sample we impute latent underlying cell types, essentially by a deconvolution of the observed proportions as a weighted average of binary indicators that define cell types by the presence or absence of different SNV’s. Taking a Bayesian perspective, we proceed with a prior probability model for all relevant unknown quantities, including in particular a prior probability model on the binary indicators that characterize the latent cell types by selecting (or not) the recorded SNV’s. Such prior models are known as feature allocation models. We define a simplified version of the Indian buffet process, one of the most traditional feature allocation models. 12:50 Floor discussion Lunch break (13:00 - 14:00) Session 3 (14:00 - 15:50). Room L5, Science Concourse 14:00 Keynote. Aki Vehtari (Aalto University, Finland) Title: On Bayesian predictive methods for high-dimensional covariate selection Abstract: In this talk, I discuss the bad behavior of commonly used Bayesian model selection methods, such as the deviance information criterion (DIC), the widely applicable information criterion (WAIC), Bayesian cross-validation, relative posterior probabilities, and Bayes factors, in high-dimensional covariate selection. Even if the model selection is based on an approach giving unbiased expected predictive performance estimates for any particular model, the data-fitted model selection procedure causes the expected predictive performance estimate of the selected model to be biased. If the number of candidate models is very large, such a model selection procedure can strongly overfit to the data. The same effect can be observed when relative posterior probabilities or Bayes factors are used for the model selection, which is not surprising since the marginal likelihood can be interpreted as a sequential predictive criterion. I also present an alternative decision theoretical solution which can avoid the model selection induced bias and overfitting. I review model selection methods which Vehtari & Ojanen (2012) call reference predictive methods, and discuss computational challenges in their use. 14:40 Donatello Telesca (UCLA, USA) Title: Mixture representations of Non Local priors and graphical model determination and estimation Abstract: We review the general representation of non local priors as mixtures of truncated distributions. This construction is applied to the definition of non local priors for selection parameters of directed and undirected Gaussian graphical models. Several constructive definitions are explored and related posterior simulation strategies compared. We focus our investigations on parameter estimation in a comparative study. 15:10 Guido Consonni (Universita Cattolica di Milano, Italy) Title: Objective Bayesian Search of Gaussian Directed Acyclic Graphical Models for Ordered Variables with Non-Local Priors Abstract: Directed Acyclic Graphical (DAG) models are increasingly employed in the study of physical and biological systems to model direct influences between variables. Identifying the graph from data is a challenging endeavour, which can be more reasonably tackled if the variables are assumed to satisfy a given ordering; in this case we simply have to estimate the presence or absence of each potential edge.Working under this assumption, we propose an objective Bayesian method for searching the space of Gaussian DAG models, which provides a rich output from minimal input. We base our analysis on nonlocal parameter priors, which are especially suited for learning sparse graphs, because they allow a faster learning rate, relative to ordinary local parameter priors, when the true unknown sampling distribution belongs to a simple model. We implement an efficient stochastic search algorithm, which deals effectively with data sets having sample size smaller than the number of variables, and apply our method to a variety of simulated and real data sets. Our approach compares favourably, in terms of the ROC curve for edge hit rate versus false alarm rate, to current state-of-the-art frequentist methods relying on the assumption of ordered variables; under this assumption it exhibits a competitive advantage over the PC-algorithm, which can be considered as a frequentist benchmark for unordered variables. Importantly, we find that our method is still at an advantage, for learning the skeleton of the DAG, when the ordering of the variables is only moderately mis-specified. Prospectively, our method could be coupled with a strategy to learn the order of the variables, thus dropping the known ordering assumption. 15:40 Floor discussion Break (15:50 - 16:10) Session 4 (16:05 - 17:15). Room L5, Science Concourse 16:05 Jianhua Hu (MD Anderson Cancer Center, USA) Title: High dimensional variable selection for correlated data Abstract: Correlated data are usually encountered in high throughput genomic and proteomic data. We propose a Bayesian model selection procedure to identify important variable clusters (e.g., gene sets) that are predictive of a disease phenotype. The clustering algorithm appears to have close connections with several existing clustering algorithms including K-means. We adopt the product moment prior on the variable parameter space to control for false positive rate in the high dimensional problem. The proposed procedure is shown to have appealing empirical performance include the case of p¿n. 16:35 Luca La Rocca (Universita di Modena e Reggio Emilia, Italy) Title: Cutting-edge issues in objective Bayesian model comparison Abstract: When a Bayes factor is used to compare two nested models, the parameter priors under the two models determine its performance as an implementation of Ockhams razor (the principle that an explanation should not be more complicated than necessary). On the one hand, asymptotically, the rate at which evidence accumulates in favour of the smaller model radically depends on the local or non-local character of the prior under the larger model. On the other hand, for small samples, the amount of evidence in favour of the smaller model will be critically inflated by diffuseness of the prior under the larger model. Moreover, parameter priors can make a difference in terms of finite-sample ability to drop the larger model when both models perfectly fit the data. In my talk, based on some recent work dealing with these issues, I shall discuss, from an objective standpoint, how sharp an edge should Bayesian barbers put to their tools, assuming they care about not cutting the throat of the larger model. 17:05 Floor discussion