Interactive Genome Analysis 2014_10_18

Interactive Genome Analysis (IGA) Problem Statement Research works best when researchers can explore their data interactively. However, the current stateof-the-art in genome analysis forces researchers to wait for hours – or indeed days – for their results to be delivered. There are the obvious reasons for this, including the sheer size of the data sets and the as-of-today still immature scientific analysis algorithms. Many approaches and vendor solutions address particular parts of this challenge, such as integration platforms, specialized database engines, a constant stream of improved algorithms, or various types of hardware accelerators. Each optimization approach is relevant – but only for particular parts of the challenge. None of these approaches embraces the overall performance problem, furthermore they are often difficult to replicate in one’s own institution, and they tend to break as use-cases and methods evolve. In practice, genome analysis is at a state where statistical data analysis was 30 years ago. Two typical approaches illustrate this challenge:   the divide-and-conquer compute approach, common for clusters running Hadoop and platforms offered by Cloud vendors, work well on generic compute nodes and for tasks that are easily parallelized, but performance degrades for data intensive tasks. Specialized hardware, on the other hand, will greatly reduce compute time, but relevant usecases and data types become fragmented. Custom accelerator chips (ASIC), for example, will increase compute performance by orders of magnitude compared to general-purpose processors. Yet, as the compute algorithm is burned to silicon, a single change in the algorithm requires a new chip, making ASICs difficult to use in the fast evolving field of genomic research. Our hypothesis is that the ingredients for Interactive Genome Analysis already exist. It is noted that some Pistoia Alliance members – to a greater or lesser extent individually – have addressed some of these issues. Yet, given the volatile nature and challenging business case of R&D, commercial vendors struggle to provide sharable blueprints. Our ideal solution would identify and pool current approaches as well as the data and use-case with which they work best, and propose a systems approach to integrate the best approaches tightly in an optimized compute environment. At a minimum, this effort will yield a state-ofthe-art catalogue of genomic analysis performance optimization. At best, we will develop an architectural blueprint that could be shared at partner institutions. Given the extreme compute requirements we are already struggling to provide for today’s research, only an integrated approach will yield the performance necessary for the Interactive Genome Analysis of tomorrow’s research. Proposal 1. Establish a Pistoia Alliance work group with an experienced, user oriented, subject matter lead. 2. Identify the use cases and functionality that scientists require for their genomic research. 3. Gather the state-of-the-art in performance optimizations (hardware, software, architectures) from Pistoia Alliance members and across the industry. 4. Develop and publish the architectural blueprint and specification that will yield Interactive Genome Analysis (architecture, use-cases, components, benchmarks, constraints, costs). 5. Establish a real-world implementation of an IGA.

Interactive Genome Analysis 2014_10_18

Related documents

Products

Support

Interactive Genome Analysis 2014_10_18

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib