Document 13170902

advertisement
Project title: GPU Programming for Nonparametric Bayesian Data Integration of Large Datasets Background: General-­‐purpose computing on graphics processing units (GP-­‐GPU) provides a promising means by which to address a variety of compute intensive tasks. At the same time, the generation of massive, complex datasets is becoming the norm for scientific researchers in disciplines as diverse as astrophysics, biology, econometrics, and weather prediction. The computational costs associated with modern statistical approaches (such as Markov chain Monte Carlo) for performing inference and extracting information from these datasets can be prohibitively expensive unless parallel implementations are developed. As discussed in Lee et al. (2010), GP-­‐GPU implementations of these approaches can offer numerous advantages over those that rely on distributed multicore clusters of CPUs (not least that GPUs are relatively cheap, and can be hosted in conventional desktop and laptop computers). In the modern biological sciences, the integration of multiple datasets (typically from various high-­‐throughput ‘omics’ technologies) has been shown to offer deeper scientific insights than may be achieved by considering data sources independently (Cooke et al., 2009; Savage et al., 2010; Yuan et al., 2011). This is intuitively unsurprising, given the complex networks of interactions between the various players (genes, proteins, metabolites, and so on) that act within living cells. We have recently developed a method for modelling multiple datasets using Bayesian nonparametrics (in particular, Dirichlet mixture models), which we have found to be very effective for integrating moderately sized datasets. We refer to this method as MDI, simply as shorthand for Multiple Dataset Integration. Objectives: The overall aim is to produce a GPU implementation of MDI that will be significantly faster than our existing Matlab code. This will allow us to tackle problems involving tens of thousands of genes and enable us to integrate more datasets than has previously been possible. We already have a GPU implementation (C++ code, developed by Suchard et al., 2010) of a method that performs Dirichlet mixture modelling of a single dataset of a particular data type. We first seek to modify this so that a variety of data types may be modelled. Once this has been achieved, the aim will be to extend the codebase in order to enable multiple datasets to be integrated using our MDI approach. We have mature (non-­‐
GPU) Matlab prototypes that will allow us to test the developed code thoroughly. What the student will do: The student will first get the existing C++ code running, and familiarise him/herself with the problem and code structure. He/she will then generalise the code in order to allow a variety of different data types to be modelled (using the existing Matlab prototypes as a guide). The existing Matlab code will also be modified to exploit GPU computing, making use of the native GPGPU support provided in Matlab R2010b and later within the Parallel Computing Toolbox. This work will form the basis of a GPU implementation (which the student will develop) of our MDI method. If time permits an implementation using the NVIDIA CUDA environment will be developed. Throughout, it will be vital to test and benchmark the implementation. We have already used our Matlab code to investigate a number of examples, so it will be important to check that the student’s implementation generates consistent results. Once we have established this, the student will apply his/her program to larger and more numerous datasets than has previously been possible, and determine the practical limits of the new implementation. Notes: This project will be associated with a recently funded EPSRC grant on Advanced Bayesian Computation for Cross-­‐Disciplinary Research, in collaboration with the Universities of Cambridge, Sussex and Kent. NVIDIA are directly supporting this project with a contribution of high-­‐end GPUs and researcher time to provide technical advice. A fully funded EPSRC PhD studentship is available on this project for a suitably qualified student from September 2012. References: Cooke et al., 2009. Computational approaches to the integration of gene expression, ChIP-­‐
chip and sequence data in the inference of gene regulatory networks. Semin Cell Dev Biol, vol. 20 (7) pp. 863-­‐8 Lee et al., 2010. On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods. J Comput Graph Stat, vol. 19 (4) pp. 769-­‐789 Savage et al., 2010. Discovering transcriptional modules by Bayesian data integration. Bioinformatics, vol. 26 (12) pp. i158-­‐67 Suchard et al., 2010. Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures. J Comput Graph Stat, vol. 19 (2) pp. 419-­‐438 Yuan et al., 2011. Patient-­‐specific data fusion defines prognostic cancer subtypes. PLoS Comput Biol, vol. 7 (10) pp. e1002227 
Download