Events - IIT College of Science - Illinois Institute of Technology

advertisement
Harvesting the Opportunity of GPUbased Acceleration
Matei Ripeanu
Networked Systems Laboratory (NetSysLab)
University of British Columbia
Joint work Abdullah Gharaibeh, Samer Al-Kiswany
1
Networked Systems Laboratory (NetSysLab)
University of British Columbia
A golf course …
… a (nudist) beach
(… and 199 days of rain each ye
2
Hybrid architectures in Top 500
[Nov’10]
3
• Hybrid architectures
– High compute power / memory bandwidth
– Energy efficient
[operated today at low efficiency]
• Agenda for this talk
– GPU Architecture Intuition
• What generates the above characteristics?
– Progress on efficiently harnessing hybrid
(GPU-based) architectures
4
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
5
6
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
7
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
8
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
9
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
10
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
11
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Feed the cores with data
Idea #3
The processing elements
are data hungry!
 Wide, high throughput
memory bus
12
10,000x parallelism!
Idea #4
Hide memory access
latency
 Hardware supported
multithreading
13
The Resulting GPU Architecture
nVidia Tesla 2050
Host
Machine
GPU
Multiprocessor N
Multiprocessor 2
 448 cores
Multiprocessor 1
Shared Memory
 Four ‘memories’
• Shared
fast – 4 cycles
small – 48KB
• Global
slow – 400-600cycles
Host
large – up to 3GB
high throughput – 150GB/sMemory
• Texture – read only
• Constant – read only
Registers
Registers
Registers
Core 1
Core 2
Core M
Instruction
Unit
Constant Memory
Texture Memory
Global Memory
 Hybrid
• PCI 16x -- 4GBps
14
GPUs offer different characteristics
 High peak compute
power
 High peak memory
bandwidth
 High host-device
communication
overhead
 Limited memory
space
 Complex to program
15
Projects at NetSysLab@UBC
http://netsyslab.ece.ubc.ca
Porting applications to efficiently exploit GPU characteristics
• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC’10
• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific
Computing Magazine, January/February 2011.
Middleware runtime support to simplify application
development
• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M.
Ripeanu, TR
GPU-optimized building blocks: Data structures and libraries
• GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09
• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC’10
• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10
• On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.
Ripeanu, JoCC‘08
16
Motivating Question:
How should we design applications to efficiently
exploit GPU characteristics?
Context:
A bioinformatics problem: Sequence Alignment
 A string matching problem
 Data intensive (102 GB)
17
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10
Past work: sequence alignment on GPUs
(%)
MUMmerGPU [Schatz 07, Trapnell 09]:
 A GPU port of the sequence alignment tool MUMmer [Kurtz 04]
 ~4x (end-to-end) compared to CPU version
Hypothesis:
mismatch between
> 50%
overhead
the core data structure
(suffix tree) and
GPU characteristics
18
Idea: trade-off time for space
 Use a space efficient data structure (though, from higher
computational complexity class): suffix array
 4x speedup compared to suffix tree-based on GPU
Consequences:
Significant
overhead
reduction
 Opportunity to exploit
multi-GPU systems as I/O
is less of a bottleneck
 Focus is shifted towards
optimizing the compute
stage
19
Outline for the rest of this talk
 Sequence alignment: background and
offloading to GPU
 Space/Time trade-off analysis
 Evaluation
20
Background: Sequence Alignment Problem
Queries
CCAT GGCT...
...TAGGC
...GGCTA
...TAGG
Reference
.....CGCCCTA
GCAATTT.... ...GCGG
TGCGC...
...CGGCA...
...GGCG
ATGCG…
.…TCGG...
TTTGCGG….
...ATAT…
.…CCTA...
CAATT….
..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..
Problem: Find where each query most likely
originated from
 Queries
 108 queries
 101 to 102 symbols length per query
 Reference
 106 to 1011 symbols length (up to ~400GB)
21
Opportunity
Sequence alignment
GPU
 Easy to partition
 Memory intensive
 Massively parallel
 High memory bandwidth
Challenges
GPU Offloading:
Opportunity and Challenges
 Data Intensive
 Large output size
 Limited memory space
 No direct access to
other I/O devices (e.g.,
disk)
22
GPU Offloading: addressing the challenges
• Data intensive problem
and limited memory space
→divide and compute in
rounds
→search-optimized datastructures
• Large output size
→compressed output
representation
(decompress on the CPU)
subrefs
= DivideRef(ref)
subqrysets = DivideQrys(qrys)
foreach subqryset in subqrysets {
results = NULL
CopyToGPU(subqryset)
foreach subref in subrefs {
CopyToGPU(subref)
MatchKernel(subqryset,
subref)
CopyFromGPU(results)
}
Decompress(results)
}
High-level algorithm (executed on the host)
23
Space/Time Trade-off Analysis
24
The core data structure
massive number of queries and long reference =>
pre-process reference to an index
Past work: build a suffix tree
A
(MUMmerGPU [Schatz 07, 09])
 Search: O(qry_len) per query
 Space: O(ref_len)
but the constant is high
~20 x ref_len
 Post-processing:
DFS traversal for each query
O(4qry_len - min_match_len)
TACACA$ CA
0
$
CA$
CA$
$
5
4
2
$
3
CA$
1
25
The core data structure
massive number of queries and long reference => preprocess reference to an index
Past work: build a suffix tree
(MUMmerGPU [Schatz 07])
 Search: O(qry_len) per query
 Space: O(ref_len), but the
constant is high: ~20xref_len
subrefs
= DivideRef(ref)
subqrysets = DivideQrys(qrys)
foreach subqryset in subqrysets {
results = NULL
CopyToGPU(subqryset)
foreach subref in subrefs {
CopyToGPU(subref) Expensive
 Post-processing:
O(4qry_len - min_match_len), DFS
traversal per query
MatchKernel(subqryset, Efficient
subref)
CopyFromGPU(results)
}
Decompress(results)
Expensive
}
26
A better matching data structure?
Suffix Tree
A
Suffix Array
TACACA$ CA
0
$
CA$
CA$
$
5
2
$
3
Compute
Space
4
CA$
0
A$
1
ACA$
2
ACACA$
3
CA$
4
CACA$
5
TACACA$
1
Less data to transfer
O(ref_len), 20 x ref_len
O(ref_len), 4 x ref_len
Search
O(qry_len)
O(qry_len x log ref_len)
Postprocess
O(4qry_len - min_match_len)
O(qry_len – min_match_len)
Impact 1: Reduced communication
27
A better matching data structure
Suffix Tree
A
Suffix Array
TACACA$ CA
0
$
CA$
CA$
$
5
2
$
3
Compute
Space
4
CA$
0
A$
1
ACA$
2
ACACA$
3
CA$
4
CACA$
5
TACACA$
1
O(ref_len), 20 x ref_len
O(ref_len), 4 x ref_len
Search
O(qry_len)
O(qry_len x log ref_len)
Postprocess
O(4qry_len - min_match_len)
O(qry_len – min_match_len)
Space for longer subreferences => fewer
processing rounds
Impact 2: Better data locality is achieved at the cost
28
of additional per-thread processing time
A better matching data structure
Suffix Tree
A
Suffix Array
TACACA$ CA
0
$
CA$
CA$
$
5
2
$
3
Compute
Space
4
CA$
0
A$
1
ACA$
2
ACACA$
3
CA$
4
CACA$
5
TACACA$
1
O(ref_len), 20 x ref_len
O(ref_len), 4 x ref_len
Search
O(qry_len)
O(qry_len x log ref_len)
Postprocess
O(4qry_len - min_match_len)
O(qry_len – min_match_len)
Impact 3: Lower post-processing overhead
29
Evaluation
30
Evaluation setup
 Testbed
 Low-end Geforce 9800 GX2 GPU (512MB)
 High-end Tesla C1060 (4GB)
 Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])
 Success metrics
 Performance
 Energy consumption
 Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)
Reference
sequence length
# of
queries
Average read
length
HS1 - Human (chromosome 2)
~238M
~78M
~200
HS2 - Human (chromosome 3)
~100M
~2M
~700
MONO - L. monocytogenes
~3M
~6M
~120
SUIS - S. suis
~2M
~26M
~36
Workload / Species
31
Speedup: array-based over tree-based
32
Dissecting the overheads
Significant
reduction in data
transfers and postprocessing
Workload: HS1, ~78M queries, ~238M ref. length on GeForce
33
Comparing with CPU performance
[baseline single core performance]
[Suffix tree]
[Suffix tree]
[Suffix array]
34
Summary
 GPUs have drastically different performance
characteristics
 Reconsidering the choice of the data structure used is
necessary when porting applications to the GPU
 A good matching data structure ensures:
 Low communication overhead
 Data locality: might be achieved at the cost of
additional per thread processing time
 Low post-processing overhead
35
Code, benchmarks and papers
available at: netsyslab.ece.ubc.ca
36
Download