Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons

advertisement
Trumping the
Multicore Memory Hierarchy
with Hi-Spade
Phillip B. Gibbons
Intel Labs Pittsburgh
April 30, 2010
Keynote talk at 10th SIAM International Conference on Data Mining
Hi-Spade: Outline / Take-Aways
• Hierarchies are important but challenging
• Hi-Spade vision: Hierarchy-savvy
algorithms & systems
• Smart thread schedulers enable simple,
hierarchy-savvy abstractions
• Flash-savvy (database) systems
maximize benefits of Flash devices
• Ongoing work w/ many open problems
Hi-Spade
3
© Phillip B. Gibbons
SDM’10 keynote
For Good Performance,
Must Use the Hierarchy Effectively
CPU
Performance:
• Running/response
L1
time
• Throughput
Hierarchy:
Cache
L2 Cache
• Power
Main Memory
Memory
Magnetic Disks
Storage
Data-intensive applications stress the hierarchy
Hi-Spade
4
© Phillip B. Gibbons
SDM’10 keynote
Clear Trend: Hierarchy Getting Richer
• More levels of cache
• Pervasive Multicore
• New memory / storage technologies
– E.g., Pervasive use of Flash
These emerging hierarchies bring both
new challenges & new opportunities
Hi-Spade
5
© Phillip B. Gibbons
SDM’10 keynote
New Trend: Pervasive Multicore
Challenges
Opportunity
• Rethink apps &
systems to
take advantage
of more CPUs
on chip
CPU
CPU
CPU
• Cores compete
for hierarchy
L1
L1
L1
• Hard to reason
about parallel
performance
Shared
L2 Cache
L2 Cache
• Hundred cores
coming soon
Main Memory
• Cache hierarchy
design in flux
Magnetic Disks
• Hierarchies differ
across platforms
Much Harder to Use Hierarchy Effectively
Hi-Spade
6
© Phillip B. Gibbons
SDM’10 keynote
New Trend: Pervasive Flash
Opportunity
CPU
CPU
CPU
L1
L1
L1
Challenges
Shared L2 Cache
Flash
Devices
• Rethink apps &
systems to
take advantage
• Performance
quirks of Flash
Main Memory
Magnetic Disks
• Technology
in flux, e.g.,
Flash Translation
Layer (FTL)
New Type of Storage in the Hierarchy
Hi-Spade
7
© Phillip B. Gibbons
SDM’10 keynote
E.g., Xeon 7500 Series MP Platform
socket
socket
2 HW
threads
32KB
256KB
8
…
2 HW
threads
2 HW
threads
32KB
256KB
24MB Shared L3 Cache
4
…
32KB
8
…
256KB
Attach: Magnetic Disks & Flash Devices
8
© Phillip B. Gibbons
32KB
256KB
24MB Shared L3 Cache
up to 1 TB Main Memory
Hi-Spade
2 HW
threads
SDM’10 keynote
How Hierarchy is Treated Today
Algorithm Designers & Application/System
Developers tend towards one of two extremes
(Pain)-Fully
Aware
Ignorant
API view:
Memory + I/O;
Parallelism often ignored
Performance iffy
Hand-tuned to platform
Effort high,
Not portable,
Limited sharing scenarios
Or they focus on one or a few aspects,
but without a comprehensive view of the whole
Hi-Spade
9
© Phillip B. Gibbons
SDM’10 keynote
From SDM’10 Call for Papers
“Extracting knowledge requires the use of
sophisticated, high-performance and principled
analysis techniques and algorithms, based on
sound theoretical and statistical foundations.
These techniques in turn require powerful
visualization technologies; implementations
that must be carefully tuned for performance;
software systems that are usable by scientists,
engineers, and physicians as well as
researchers; and infrastructures that support
them.”
Hi-Spade
10
© Phillip B. Gibbons
SDM’10 keynote
Hierarchy-Savvy parallel algorithm
design (Hi-Spade) project
…seeks to enable:
A hierarchy-savvy approach to algorithm design
& systems for emerging parallel hierarchies
• Ignore what can be ignored
• Focus on what must be
exposed for good performance
• Robust across many platforms
& resource sharing scenarios
“HierarchySavvy”
• Sweet-spot between ignorant
and (pain)fully aware
Hi-Spade
11
http://www.pittsburgh.intel-research.net/projects/hi-spade/
© Phillip B. Gibbons
SDM’10 keynote
Hierarchy-Savvy Sweet Spot
HierarchySavvy
Platform 1
(Pain)-Fully
Aware
performance
Platform 2
Ignorant
programming effort
Modest effort, good performance, robust
Hi-Spade
12
© Phillip B. Gibbons
SDM’10 keynote
Hi-Spade Research Scope
A hierarchy-savvy approach to algorithm design
& systems for emerging parallel hierarchies
Agenda: Create abstractions, tools & techniques that
• Assist programmers & algorithm designers in
achieving effective use of emerging hierarchies
• Lead to systems that better leverage the new
capabilities these hierarchies provide
Theory / Systems / Applications
Hi-Spade
13
© Phillip B. Gibbons
SDM’10 keynote
Hi-Spade Collaborators
• Intel Labs Pittsburgh: Shimin Chen (co-PI)
• Carnegie Mellon: Guy Blelloch, Jeremy Fineman,
Robert Harper, Ryan Johnson, Ippokratis Pandis,
Harsha Vardhan Simhadri, Daniel Spoonhower
• Microsoft Research: Suman Nath
• EPFL: Anastasia Ailamaki, Manos Athanassoulis,
Radu Stoica
• University of Pittsburgh: Panos Chrysanthis,
Alexandros Labrinidis, Mohamed Sharaf
Hi-Spade
14
© Phillip B. Gibbons
SDM’10 keynote
Hi-Spade: Outline
• Hierarchies are important but challenging
• Hi-Spade vision: Hierarchy-savvy
algorithms & systems
• Smart thread schedulers enable simple,
hierarchy-savvy abstractions
• Flash-savvy (database) systems
maximize benefits of Flash devices
• Ongoing work w/ many open problems
Hi-Spade
15
© Phillip B. Gibbons
SDM’10 keynote
Abstract Hierarchy: Target Platform
Specific
Example:
…
…
…
Xeon 7500
General Abstraction: Tree of Caches
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Hi-Spade
16
© Phillip B. Gibbons
SDM’10 keynote
Abstract Hierarchy: Simplified View
What yields good hierarchy performance?
• Spatial locality: use what’s brought in
– Popular sizes: Cache lines 64B; Pages 4KB
• Temporal locality: reuse it
• Constructive sharing: don’t step on others’ toes
How might one simplify the view?
• Approach 1: Design to a 2 or 3 level hierarchy (?)
• Approach 2: Design to a sequential hierarchy (?)
• Approach 3: Do both (??)
Hi-Spade
17
© Phillip B. Gibbons
SDM’10 keynote
Sequential Hierarchies: Simplified View
• External Memory Model
– See [J.S. Vitter, ACM Computing Surveys, 2001]
Main Memory (size M)
Block size B
External Memory
Simple model
Minimize I/Os
Only 2 levels
Only 1 “cache”
External Memory Model
Can be good choice if bottleneck is last level
Hi-Spade
18
© Phillip B. Gibbons
SDM’10 keynote
Sequential Hierarchies: Simplified View
• Ideal Cache Model [Frigo et al., FOCS’99]
Main Memory (size M)
Block size B
External Memory
Twist on EM Model:
M & B unknown
to Algorithm
simple model
Ideal Cache Model
Key Algorithm Goal: Good performance for any M & B
Key Goal
Guaranteed good cache performance
at all levels of hierarchy
Single CPU only (All caches shared)
Encourages Hierarchical Locality
Hi-Spade
19
© Phillip B. Gibbons
SDM’10 keynote
Example Paradigms Achieving Key Goal
• Scan: e.g., computing the sum of N items
N/B misses, for
any B (optimal)
• Divide-and-Conquer: e.g., matrix multiply C=A*B
A11*B11 A11*B12
+
+
A12*B21 A12*B22
A21*B11 A21*B12
+
+
A22*B21 A22*B22
A11
A12
=
B11
B12
B21
B22
*
A21
A22
Uses
Divide: Recursively compute A11*B11,…, A22*B22
Recursive
Conquer: Compute 4 quadrant sums
Z-order
O(N2/B + N3/(B*√M)) misses (optimal)
Layout
Hi-Spade
20
© Phillip B. Gibbons
SDM’10 keynote
Multicore Hierarchies: Possible Views
Design to Tree-of-Caches abstraction:
• Multi-BSP Model [L.G. Valiant, ESA’08]
– 4 parameters/level:
cache size, fanout,
latency/sync cost,
transfer bandwidth
– Bulk-Synchronous
…
…
…
…
…
…
…
…
…
…
…
Our Goal:
• Approach simplicity of Ideal Cache Model
– Hierarchy-Savvy sweet spot
– Do not require bulk-synchrony
Hi-Spade
21
© Phillip B. Gibbons
SDM’10 keynote
Multicore Hierarchies: Key Challenge
• Theory underlying Ideal Cache Model falls apart
once introduce parallelism:
Good performance for any M & B on 2 levels
DOES NOT
imply good performance at all levels of hierarchy
Keyreason:
reason:Caches
Cachesnot
notfully
fullyshared
shared
Key
CPU1
CPU2
CPU3
L1
L1
L1
B
Shared
L2 Cache
L2 Cache
Hi-Spade
22
© Phillip B. Gibbons
What’s good for CPU1 is
often bad for CPU2 & CPU3
e.g., all want to write B
at ≈ the same time
SDM’10 keynote
Multicore Hierarchies
Key New Dimension: Scheduling
Key new dimension:
Scheduling of parallel threads
Has LARGE impact on cache performance
Recall
our
problem scenario:
Key reason: Caches not
fully
shared
CPU3
CPU2
CPU1
all CPUs want to write B
at ≈ the same time
L1
B
L1
L1
Shared
L2 Cache
L2 Cache
Hi-Spade
23
Can mitigate (but not solve)
if can schedule the writes
to be far apart in time
© Phillip B. Gibbons
SDM’10 keynote
Key Enabler: Fine-Grained Threading
• Coarse Threading popular for decades
– Spawn one thread per core at program initialization
– Heavy-weight O.S. threads
– E.g., Splash Benchmark
• Better Alternative:
–
–
–
–
Hi-Spade
System supports user-level light-weight threads
Programs expose lots of parallelism
Dynamic parallelism: forking can be data-dependent
Smart runtime scheduler maps threads to cores,
dynamically as computation proceeds
24
© Phillip B. Gibbons
SDM’10 keynote
Cache Uses Among Multiple Threads
Destructive
Constructive
• compete for the limited
• share a largely
on-chip cache
overlapping working set
P
P
P
P
P
P
L1
L1
L1
L1
L1
L1
Interconnect
Shared L2
Cache
Hi-Spade
25
Interconnect
Shared L2
Cache
“Flood”
off-chip
PINs
© thanks
Phillip B.to
Gibbons
Slide
Shimin Chen
SDM’10 keynote
Smart Thread Schedulers
• Work Stealing
– Give priority to tasks in local work queue
– Good for private caches
• Parallel Depth-first (PDF) [JACM’99, SPAA’04]
– Give priority to earliest ready tasks in
the sequential schedule
– Good for shared caches
P
Sequential locality
to parallel locality
L1
L2 Cache
Main Memory
Hi-Spade
26
© Phillip B. Gibbons
P
P
P
P
L1
L1
L1
L1
Shared L2 Cache
Main Memory
SDM’10 keynote
Parallel Merge Sort: WS vs. PDF
8 cores
Work Stealing (WS):
Parallel Depth First (PDF):
Cache miss
Cache hit
Mixed
Shared cache = 0.5 *(src array size + dest array size).
Hi-Spade
27
© Phillip B. Gibbons
SDM’10 keynote
Private vs. Shared Caches
• 3-level multi-core model
• Designed new scheduler (Controlled PDF) with
provably good cache performance for class of
divide-and-conquer algorithms [SODA08]
CPU1
CPU2
CPU3
L1
L1
L1
Shared
L2 Cache
L2 Cache
Results require exposing
working set size for
each recursive subproblem
Main Memory
Hi-Spade
28
© Phillip B. Gibbons
SDM’10 keynote
Low-Span + Ideal Cache Model
• Observation: Guarantees on cache performance
depend on the computation’s span S (length of
critical path)
– E.g., Work-stealing on single level of private caches:
Thrm: For any computation w/ fork-join parallelism,
O(M P S / B) more misses on P cores than on 1 core
• Approach: Design parallel algorithms with
– Low span, and
[SPAA’10]
– Good performance on Ideal Cache Model
Thrm: For any computation w/ fork-join parallelism
for each level i, only O(M i P S / Bi ) more misses
than on 1 core, for hierarchy of private caches
Hi-Spade
29
Low span
© Phillip B.S
Gibbons
GoodSDM’10
miss
bound
keynote
Challenge of General Case
Tree-of-Caches
• Each subtree has a given amount of compute &
cache resources
• To avoid cache misses from migrating tasks,
would like to assign/pin task to a subtree
• But any given program task may not match both
– E.g., May need large cache but few processors
Hi-Spade
30
© Phillip B. Gibbons
SDM’10 keynote
Hi-Spade: Outline
• Hierarchies are important but challenging
• Hi-Spade vision: Hierarchy-savvy
algorithms & systems
• Smart thread schedulers enable simple,
hierarchy-savvy abstractions
• Flash-savvy (database) systems
maximize benefits of Flash devices
• Ongoing work w/ many open problems
Hi-Spade
31
© Phillip B. Gibbons
SDM’10 keynote
Flash Superior to Magnetic Disk
on Many Metrics
• Energy-efficient
• Smaller
• Lighter
• More durable
• Higher throughput
• Less cooling cost
Hi-Spade
32
© Phillip B. Gibbons
SDM’10 keynote
Flash-Savvy Systems
• Simply replacing some magnetic disks with
Flash devices WILL improve performance
However:
• Much of the performance left on the table
– Systems not tuned to Flash characteristics
Flash-savvy systems:
• Maximize benefits of platform’s flash devices
– What is best offloaded to flash?
Many papers in this area--Discuss only our results
Hi-Spade
33
© Phillip B. Gibbons
SDM’10 keynote
NAND Flash Chip Properties
Block (64-128 pages)
Page (512-2048 B)
…
Read/write pages,
erase blocks
…
• Write page once after a block is erased
In-place update
1. Copy
2. Erase
3. Write
4. Copy
5. Erase
Random
© Phillip B. Gibbons
0.4ms 0.6ms
Read
SDM’10 keynote
Sequential
34
Random
Hi-Spade
Sequential
• Expensive operations:
• In-place updates
• Random writes
0.4ms 127ms
Write
Using “Semi-Random” Writes
in place of Random Writes
Energy to Maintain Random Sample
Our algorithm
Our Algorithm
[VLDB’08]
Hi-Spade
35
On Lexar CF card
© Phillip B. Gibbons
SDM’10 keynote
Quirks of Flash (Mostly) Hidden
by SSD Firmware
Sequential Reads
Intel X25-M SSD
0.25
time (ms)
0.2
0.15
0.1
0.05
16K
8K
4K
2K
1K
512
0
Request Size
seq-read
seq-write
ran-read
ran-write
Random writes & in-place updates no longer slow
Hi-Spade
36
© Phillip B. Gibbons
SDM’10 keynote
Flash Logging (1/3)
[SIGMOD’09]
Transactional logging: major bottleneck
• Today, OLTP Databases can fit into main memory
(e.g., in TPCC, 30M customers < 100GB)
• In contrast, must flush redo log to stable media
at commit time
Log access pattern: small sequential writes
• Ill-suited for magnetic disks: incur full rotational
delays
• Alternative solutions are expensive or
complicated
Exploiting flash devices for logging
Hi-Spade
37
© thanks
Phillip B.to
Gibbons
Slide
Shimin Chen
SDM’10 keynote
Flash Logging (2/3)
USB flash drives are a good match
• Widely available USB ports
• Inexpensive: use multiple devices for better
performance
• Hot-plug: cope with limited erase cycles
• Multiple USB flash drives achieve better
performance with lower price than a single SSD
Our solution: FlashLogging
• Unconventional array design
• Outlier detection
Worke
& hiding
r
Worke
• Efficient recovery
r
Worke
r
Hi-Spade
38
© thanks
Phillip B.to
Gibbons
Slide
Shimin Chen
Databas
e
Request queue
Interface
In-memory log buffer
SDM’10 keynote
new order transactions per minute
Flash Logging (3/3)
35000
30000
25000
20000
15000
10000
5000
0
disk
Ideal
ssd
usb-A usb-B usb-C ssd
original
FlashLogging
• Up to 5.7X improvements over disk based logging
• Up to 98% of ideal performance
• Multiple USB flash drives achieve better performance
than a single SSD, at fraction of the price
Hi-Spade
39
© thanks
Phillip B.to
Gibbons
Slide
Shimin Chen
SDM’10 keynote
PR-Join for Online Aggregation
• Data warehouse and business intelligence
– Fast growing multi-billion dollar market
• Interactive ad-hoc queries
– Important for detecting new trends
– Fast response times hard to achieve
• One promising approach: Online aggregation
– Provides early representative results
for aggregate queries (sum, avg, etc), i.e.,
estimates & statistical confidence intervals
– Problem: Queries with joins are too slow
• Our goal: A faster join for online aggregation
Hi-Spade
40
© Phillip B. Gibbons
SDM’10 keynote
Early Representative
Result Rate
Low
High
Design Space
Hi-Spade
41
PR-Join
targets
Rippl
e
SMS
Hash Ripple
GRAC
LowE Total I/O Cost
© thanks
Phillip B.to
Gibbons
Slide
Shimin Chen
SDM’10 keynote
High
Background: Ripple Join
A join B: find matching records of A and B
records from B
spilled
new
spilled new
records from A
For each ripple:
• Read new records
from A and B;
check for matches
• Read spilled
records; check for
matches with
new records
• Spill new to disk
Join: Checks all pairs of
records from A and B
Hi-Spade
Problem: Ripple
width limited
bykeynote
memory size
© Phillip B. Gibbons
SDM’10
42
Partitioned expanding Ripple Join
PR-Join Idea: Multiplicatively expanding ripples
• Higher result rate
• Representative results
empty
To overcome
Ripple width > memory:
& hash partitioning
• Each partition < memory
• Report results per
partitioned ripple
empty
Partitioned on Join key
[Sigmod’10]
Hi-Spade
43
© Phillip B. Gibbons
SDM’10 keynote
PR-Join leveraging SSD
Near-optimal
total I/O cost
Higher early
result rate
Setting: 10GB joins 10GB, 500MB memory
Inputs on HD; SSD for temp storage
Hi-Spade
44
© Phillip B. Gibbons
SDM’10 keynote
Concurrent Queries & Updates
in Data Warehouse
• Data Warehouse queries dominated
by table scans
– Sequential scan on HD
• Updates are delayed to avoid interfering
– E.g., Mixing random updates with TPCH queries
would incur 2.9X query slowdown
– Thus, queries are on stale data
Hi-Spade
45
© Phillip B. Gibbons
SDM’10 keynote
Concurrent Queries & Updates
in Data Warehouse
• Our Approach: Cache updates on SSD
– Queries take updates into account on-the-fly
– Updates periodically migrated to HD in batch
– Improves query latency by 2X,
improves update throughput by 70X
Data
Warehouse
2. Query processing
Table
(range)
scan
Merge
1. Incoming
updates
Related updates
3. Migrate updates
Disks (main data)
Hi-Spade
46
© Phillip B. Gibbons
SSD
(updates)
SDM’10 keynote
Hi-Spade: Outline
• Hierarchies are important but challenging
• Hi-Spade vision: Hierarchy-savvy
algorithms & systems
• Smart thread schedulers enable simple,
hierarchy-savvy abstractions
• Flash-savvy (database) systems
maximize benefits of Flash devices
• Ongoing work w/ many open problems
Hi-Spade
47
© Phillip B. Gibbons
SDM’10 keynote
Publications (1)
Cache Hierarchy & Schedulers:
1.
PDF scheduler for shared caches [SPAA’04]
2.
Scheduling for constructive sharing [SPAA’07]
3.
Controlled-PDF scheduler [SODA’08]
4.
Combinable MBTs [SPAA’08]
5.
Semantic space profiling & visualization
[ICFP’08]
6.
Scheduling beyond nested parallelism [SPAA’09]
7.
Low depth paradigm & algorithms [SPAA’10]
8.
Parallel Ideal Cache model [under submission]
Hi-Spade
48
© Phillip B. Gibbons
SDM’10 keynote
Semantic Space Profiling
Heap Use DAGs
showing peak use
for 2 schedulers
work stealing
breadth-first
Matrix Multiply
Allocation point
breakdown
(2 cores, breadth-first)
[ICFP’08]
Hi-Spade
49
© Phillip B. Gibbons
SDM’10 keynote
Publications (2)
Flash-savvy database systems:
1.
Semi-random writes [VLDB’08]
2.
Flash-based transactional logging [Sigmod’09]
3.
PR-Join for online aggregation [Sigmod’10]
4.
I/O scheduling for transactional I/Os
[under submission]
5.
Concurrent warehousing queries & updates
[under submission]
Hi-Spade
50
© Phillip B. Gibbons
SDM’10 keynote
Many Open Problems
• Hierarchy-savvy ideal: Simplified view
+ thread scheduler that will rule the world
• New tools & architectural features that will help
• Extend beyond MP platform to cluster/cloud
• Richer class of algorithms: Data mining, etc.
• Hierarchy-savvy scheduling for power savings
• PCM-savvy systems: How will Phase Change
Memory change the world?
Hi-Spade
51
© Phillip B. Gibbons
SDM’10 keynote
Hi-Spade: Conclusions
• Hierarchies are important but challenging
• Hi-Spade vision: Hierarchy-savvy
algorithms & systems
• Smart thread schedulers enable simple,
hierarchy-savvy abstractions
• Flash-savvy (database) systems
maximize benefits of Flash devices
• Ongoing work w/ many open problems
Hi-Spade
52
© Phillip B. Gibbons
SDM’10 keynote
Download