GPUGPU_brain_lam - bioinformatics department of CIPF

advertisement
A massively parallel solution to a massively parallel problem
Brian Lam, PhD
HPC4NGS Workshop, May 21-22, 2012
Principe Felipe Research Center, Valencia

Next-generation sequencing and current challenges

What is GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

Next-generation sequencing and current challenges

What is GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions
“DNA sequencing includes several methods
and technologies that are used for determining
the order of the nucleotide bases A, G, C, T in a
molecule of DNA.” - Wikipedia

Commercialised in 2005 by 454 Life Science

Sequencing in a massively parallel fashion

An example: Whole genome
sequencing
 Extract genomic DNA from blood/mouth swabs
 Break into small DNA fragments of 200-400 bp
 Attach DNA fragments to a surface (flow
cells/slides/microtitre plates) at a high density
 Perform concurrent “cyclic sequencing reaction” to obtain the
sequence of each of the attached DNA fragments
An Illumina HiSeq 2000 can interrogate
825K spots / mm2
ATTCNGG
GTCCTGA
Not to scale
TATTTTT
Billions of short DNA sequences, also called
sequence reads ranging from 25 to 400 bp
Image Analysis
Base Calling
Sequence
Alignment
Variant Calling,
Peak calling
Source: Genologics

Next-generation sequencing and current challenges

What is Many-core/GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

A physical die that contains a ‘large number’ of
processing

i.e. computation can be done in a massively parallel
manner

Modern graphics cards (GPUs) consist of hundreds
to thousands of computing cores
Chapter 1. Introduction
CPUs vs GPUs
Transistor usage:
Source: NVIDIA

GPUs are fast, but there is a catch:
 SIMD – Single instruction, multiple data
VS

CPUs are powerful multi-purpose processors
 MIMD – Multiple Instructions, multiple data
CPUs vs GPUs
Transistor usage:
CPUs vs GPUs
Transistor usage:
Source: NVIDIA
Source: NVIDIA

The very same (low level) instruction is applied to
multiple data at the same time
 e.g. a GTX680 can do addition to 1536 data point at a time,
versus 16 on a 16-core CPU.

Branching results in serialisations

The ALUs on GPU are usually much more primitive
compared to their CPU counterparts.

Scientific computing often deal with large
amount of data, and in many occasions,
applying the same set of instructions to these
data.
 Examples:
▪ Monte Carlo simulations
▪ Image analysis
▪ Next-generation sequencing data analysis

Low capital cost and energy efficient
 Dell 12-core workstation £5,000, ~1kW
 Dell 40-core computing cluster £ 20,000+, ~6kW
 NVIDIA Geforce GTX680 (1536 cores): £400, <0.2kW
 NVIDIA C2070 (448 cores): £1000, 0.2kW

Many supercomputers also contain multiple GPU nodes
for parallel computations

Examples:
 CUDASW++  6.3X
 MUMmerGPU  3.5X
 GPU-HMMer  60-100x

Next-generation sequencing and current challenges

What is Many-core/GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

Ion-torrent server (T7500 workstation) uses
GPUs for base-calling

MummerGPU – comparisons among
genomes

BarraCUDA, Soap3 – short read alignments

Next-generation sequencing and current challenges

What is Many-core/GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions
Image Analysis
Base Calling
Sequence
Alignment
Variant Calling,
Peak calling
Sequence
alignment
is aCPU
crucial
stepto
in perform
the
This step
often takes
many
hours
bioinformatics pipeline for downstream analyses
Usually done on HPC clusters

The main objective of the BarraCUDA project
is to develop a software that runs on GPU/
many-core architectures
 i.e. to map sequence reads the same way as they
come out from the NGS instrument
CPU
Read
library
Copy read library
to GPU
Copy genome to
GPU
Genome
GPU
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
Alignment
CPU
Copy alignment
results to CPU
Write to disk
Alignment
Results

Originally intended for data compression, performs reversible
transformation of a string

In 2000, Ferragina and Manzini introduced BWT-based index data
structure for fast substring matching at O(n)

Sub-string matching is performed in a tree traversal-like manner

Used in major sequencing read mapping programs e.g. BWA,
Bowtie, Soap2
start
n
a
b
b
n
a
a
n
na
a
an
b
n
nana
a
b
ba
ana
ban
b
bana
n
nan
a
anana
anan
b
banana
matching of anb
matching of banan
matching substring ‘banan’
b
banan
,
Modified from Li & Durbin
Bioinformatics 2009, 25:14, 1754-1760
BWT_exactmatch(READ,i,k,l){
if (i < 0) then return RESULTS;
k = C(READ[i]) + O(READ[i],k-1)+1;
l = C(READ[i]) + O(READ[i],l);
if (k <= l) then BWT_exactmatch(READ,i-1,k,l);
}
main(){
Calculate reverse BWT string B from reference string X
Calculate arrays C(.) and O (.,.) from B
Load READS from disk
For every READ in READS do{
i = |READ|;
 Position
k = 0;
 Lower bound
l = |X|;
 Upper bound
BWT_exactmatch(READ,i,k,l);
}
Write RESULTS to disk
}
Modified from Li & Durbin
Bioinformatics 2009, 25:14, 1754-1760

Simple data parallelism

Used mainly the GPU for matching
__device__BWT_GPU_exactmatch(W,i,k,l){
if (i < 0) then return RESULTS;
k = C(W[i]) + O(W[i],k-1)+1;
l = C(W[i]) + O(W[i],l);
if (k <= l) then BWT_GPU_exactmatch(W,i-1,k,l);
}
__global__GPU_kernel(){
W = READS[thread_no];
i = |W|;
 Position
k = 0;
 Lower bound
l = |X|;
 Upper bound
BWT_GPU_exactmatch(W,i,k,l);
}
main(){
Calculate reverse BWT string B from reference string X
Calculate array C(.) and O(.,.) from B
Load READS from disk
Copy B, C(.) and O(.) to GPU
Copy READS to GPU
Launch GPU_kernel with <<|READS|>> concurrent threads
COPY Alignment Results back from GPU
Write RESULTS to disk
}

Very fast indeed, using a Tesla C2050, we can
match 25 million 100bp reads to the BWT in
just over 1 min.

But… is this relevant?
start
n
a
b
b
n
a
a
n
na
a
n
a
b
ba
ana
nana
an
b
ban
b
bana
n
nan
a
anana
anan
b
banana
matching of anb
matching
of ‘anb’
banan
matching
substring
where
‘b’ is subsituted with an ‘a’
b
banan
Search space complexity = O(9n)!
Klus et al., BMC Res Notes 2012 5:27

partially
It _________worked!
 10% faster than 8x X5472 cores @ 3GHz
 BWA uses a greedy breadth-first search approach (takes up to 40MB
per thread)
 Not enough workspace for thousands of concurrent kernel threads (@
4KB) – i.e. reduced accuracy – NOT GOOD ENOUGH!
hit
hit

The very same (low level) instruction is applied to
multiple data at the same time
 e.g. a GTX680 can do addition to 1536 data point at a time,
versus 16 on a 16-core CPU.

Branching results in serialisations

The ALUs on GPU are usually much more primitive
compared to their CPU counterparts.
space increases exponent ially wit h longer read lengt hs (as was explained in Sect ion 1.5). Even
t hough t he programs are imposing general search space rest rict ions, like t he maximum number
of di↵erences and seeding (det ails are covered in t he Sect ion 3.1), t he speed reduct ion can not be
complet ely eliminat ed. T here is always a balance between speed and accuracy, t he more aggressive
t he search space reduct ion approach is, t he lower t he accuracy of t he alignment becomes.
Tasks to be performed
CPU parallelism
GPU parallelism
Thread 0
Thread 1
Thread 0
Thread 1
Thread 0
Thread 1
A
A
A
A
A
A
B
D
B
D
B
B
C
C
C
C
D
D
C
C
B
Data-empty instruction
B
Data-storing instruction
F i gur e 2.2 – Illust rat ion of how t hread execut ion divergence is handled by t he GPU. T he left panel
shows t hreads wit h t asks t o perform, part s of t heir execut ion pat hs di↵er. T he middle panel shows
how mult i-core CPU handles such divergence - it does not pose a problem, t hreads are execut ed
YES
Determine if the query sequence
has more than three N’s
No
Calculate the boundaries of differences for the query
sequence (widths and bids)
Start from the first base of the query and set the full
BWT suffix array as the search boundary
Exit
YES
Repeat
Look for all possible hits within the search boundary
(perfect match, mismatches, indels)
Check if there is any hit (k
No
Has reached the first base of the
query sequence
No
Move one base backwards and look
for the next best hit
YES
Record Alignment (k, l and
differences)
YES
Has reached the last base in the query sequence
No
Choose and store the best hit, set the new search
boundary with k & and proceed to next base
Thread 1
Thread 1.1
Thread 1.2
A
B
CPU thread queue:
A
hit
hit
B
Klus et al., BMC Res Notes 2012 5:27
BWA 0.5.8
BarraCUDA
Mapping
89.95%
91.42%
Error
0.06%
0.05%
Klus et al., BMC Res Notes 2012 5:27
Time Taken (min)
160
153
105
120
86
80
52
40
21
20
0
0
5160 (1C)
X5670 (1C)
5160 (2C)
5160 (4Cs) X5670 (6Cs)
0
M2090

Simple data parallelism

Used mainly the GPU for matching
Klus et al., BMC Res Notes 2012 5:27

Next-generation sequencing and current challenges

What is Many-core/GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

Hardware

The system must have at least one decent GPU
▪ NVIDIA Geforce 210 will not work!

One or more PCIe x16 slots

A decent power supply with appropriate power connectors
▪ 550W for one Tesla C2075, and + 225W for each additional cards
▪ Don’t use any Molex converters!

Ideally, dedicate cards for computation and use a separate cheap card for display

Software
 CUDA
▪ CUDA toolkit
▪ Appropriate NVIDIA drivers
 OpenCL
▪ CUDA toolkit (NVIDIA)
▪ AMD APP SDK (AMD)
▪ Appropriate AMD/NVIDIA drivers
For example: CUDAtoolkit v4.2

DEMO

Next-generation sequencing and current challenges

What is Many-core/GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

In the end, how much effort have we put in so far?
 3300 lines of new code for the alignment core
▪ Compared to 1000 lines in BWA
▪ Still on going!
 1000 lines for SA to linear space conversion

CUDA
 (in general) is faster and more complete
 Tied to NVIDIA hardware

OpenCL
 Still new, AMD and Apple are pushing hard on this
 Can run on a variety of hardware including CPUs

NVIDIA developer zone
 http://developer.nvidia.com/

OpenCL developer zone
 http://developer.amd.com/zones/OpenCLZone/Pa
ges/default.aspx

Different hardware has different capabilities
 GT200 is very different from GF100, in terms of cache
arrangement and concurrent kernel executions
 GF100 is also different from GK110 where the latter can do
dynamic parallelism

Low level data management is often required,
 e.g. earmarking data for memory type, coalescing
memory access

The easy way out!
 A set of compiler directives
 It allows programmers to use ‘accelerator’ without
explicitly writing GPU code
 Supported by CAPS, Cray, NVIDIA, PGI
Source: NVIDIA
http://www.openacc-standard.org/

Next-generation sequencing and current challenges

What is Many-core/GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

The game is changing quickly
 CPUs are also getting more cores : AMD 16-core Opteron 6200 series

Many-core platforms are evolving
 Intel MIC processors (50 x86 cores)
 NVIDIA Kepler platform (1536 cores)
 AMD Radeon 7900 series (2048 cores)

SIMD nodes are becoming more common in supercomputers

OpenCL

Next-generation sequencing and current challenges

What is Many-core/GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

Few software are available for NGS bioinformatics yet, more to come

BarraCUDA is one of the first attempts to accelerate NGS bioinformatics
pipeline

Significant amount of coding is usually required, but more programming
tools are becoming available

Many-core is still evolving rapidly and things can be very different in the
next 2-3 years
IMS-MRL, Cambridge
Giles Yeo
Petr Klus
Simon Lam
Gurdon Institute,
Cambridge
Nicole Cheung
NIHR-CBRC, Cambridge
Ian McFarlane
Cassie Ragnauth
HPCS, Cambridge
Stuart Rankin
Whittle Lab, Cambridge
Graham Pullan
Tobias Brandvik
NVIDIA Corporation
Thomas Bradley
Timothy Lanfear
Microbiology,
University College Cork
Dag Lyberg
Download