Interactive Genome Analysis 2014_10_18

advertisement
Interactive Genome Analysis (IGA)
Problem Statement
Research works best when researchers can explore their data interactively. However, the current stateof-the-art in genome analysis forces researchers to wait for hours – or indeed days – for their results to
be delivered.
There are the obvious reasons for this, including the sheer size of the data sets and the as-of-today still
immature scientific analysis algorithms. Many approaches and vendor solutions address particular parts
of this challenge, such as integration platforms, specialized database engines, a constant stream of
improved algorithms, or various types of hardware accelerators. Each optimization approach is relevant
– but only for particular parts of the challenge. None of these approaches embraces the overall
performance problem, furthermore they are often difficult to replicate in one’s own institution, and they
tend to break as use-cases and methods evolve. In practice, genome analysis is at a state where statistical
data analysis was 30 years ago.
Two typical approaches illustrate this challenge:


the divide-and-conquer compute approach, common for clusters running Hadoop and platforms
offered by Cloud vendors, work well on generic compute nodes and for tasks that are easily
parallelized, but performance degrades for data intensive tasks.
Specialized hardware, on the other hand, will greatly reduce compute time, but relevant usecases and data types become fragmented. Custom accelerator chips (ASIC), for example, will
increase compute performance by orders of magnitude compared to general-purpose
processors. Yet, as the compute algorithm is burned to silicon, a single change in the algorithm
requires a new chip, making ASICs difficult to use in the fast evolving field of genomic research.
Our hypothesis is that the ingredients for Interactive Genome Analysis already exist. It is noted that some
Pistoia Alliance members – to a greater or lesser extent individually – have addressed some of these
issues. Yet, given the volatile nature and challenging business case of R&D, commercial vendors struggle
to provide sharable blueprints. Our ideal solution would identify and pool current approaches as well as
the data and use-case with which they work best, and propose a systems approach to integrate the best
approaches tightly in an optimized compute environment. At a minimum, this effort will yield a state-ofthe-art catalogue of genomic analysis performance optimization. At best, we will develop an
architectural blueprint that could be shared at partner institutions. Given the extreme compute
requirements we are already struggling to provide for today’s research, only an integrated approach will
yield the performance necessary for the Interactive Genome Analysis of tomorrow’s research.
Proposal
1. Establish a Pistoia Alliance work group with an experienced, user oriented, subject matter lead.
2. Identify the use cases and functionality that scientists require for their genomic research.
3. Gather the state-of-the-art in performance optimizations (hardware, software, architectures)
from Pistoia Alliance members and across the industry.
4. Develop and publish the architectural blueprint and specification that will yield Interactive
Genome Analysis (architecture, use-cases, components, benchmarks, constraints, costs).
5. Establish a real-world implementation of an IGA.
Download