Stream Caching: A Mechanism to Support Multi-Record Computations within Stream Processing Architectures

advertisement
Stream Caching:
Mechanisms for General Purpose Stream Processing
Nat Duca
Jonathan Cohen
Johns Hopkins University
Peter Kirchner
IBM Research
Talk Outline
●
●
●
Objective: reconcile current practices of
CPU design with stream processing theory
Part 1: Streaming Ideas in current
architectures
–
Latency and Die-Space
–
Processor types and tricks
Part 2: Insights about Stream Caches
–
Could window-based streaming be the next
step in computer architecture?
Streaming Architectures
●
Graphics processors
●
Signal processors
●
Network processors
●
Scalar/Superscalar processors
●
Data stream processors?
●
Software architectures?
What is a Streaming Computer?
●
Two [overlapping] ideas
–
A system that executes strict-streaming
algorithms [unbounded N, small M]
–
A general purpose system that is geared toward
general computation, but is best for the
streaming case
●
●
Big motivator: ALU-bound computation!
To what extent do present computer
architectures serve these two views of a
streaming computer?
[Super]scalar Architectures
●
●
●
Keep memory latency from limiting
computation speed
Solutions:
–
Caches
–
Pipelining
–
Prefetching
–
Eager execution / branch prediction
[the super in superscalar]
These are heuristics to locate streaming
patterns in unstructured program behavior
By the Numbers, Data
●
●
●
Optimized using caches, pipelines, and eagerexecution
–
Random:
182MB/s
–
Sequential:
315MB/s
Optimizing with prefetching
–
Random:
490MB/s
–
Sequential:
516MB/s
Theoretical Maximum:
533MB/s
By the Numbers, Observations
●
●
●
●
Achieving full throughput on a scalar CPU
requires either
–
(a) prefetching [requires advance knowledge]
–
(b) sequential access [no advances req'd]
Vector architectures hide latency in their
instruction set using implicit prefetching
Dataflow machines solve latency using
automatic prefetching
Rule 1: Sequential I/O simplifies control and
access to memory, etc
Superscalar (e.g. P4)
Local Memory Hierarchy
Prefetc
Cache
h
Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
Prefetch
The P4, by surface area,
is about 95% cache,
prefetch, and branchprediction logic.
The remaining area is
primarily the floating
point ALU.
Pure Streaming (e.g. Imagine)
In Streams
Out Streams
Can We Build This Machine?
Local Memory Hierarchy
In Streams
●
Out Streams
Rule 2: Small memory footprint allows more room
for ALU --> more throughput
Part II: Chromium
●
Pure stream processing model
●
Deals with OpenGL command stream
–
●
●
Begin(Triangles); Vertex, Vertex, Vertex; End;
Record splits are supported, joins are not
You perform useful computation in Chromium by
joining together Stream Processors into a DAG
–
Note: DAG is constructed across multiple
processors (unlike dataflow)
Chromium w/ Stream Caches
●
We added join capability to Chromium for the
purpose of collapsing multiple records to
one
–
●
●
●
Incidentally: this allows windowed
computations
Thought: there seems to be direct
connection between streaming-joins and
sliding-windows
Because we're in software, the windows can
become quite big without too much hassle
What if we move to hardware?
Windowed Streaming
Window Buffer
In Streams
Uses for Window Buffer of size M:
Store program structures of up to size M
●Cache M input records, where M << N
●
Out Streams
Windowed Streaming
Window Buffer
In Streams
Out Streams
Realistic values of M if you stay exclusively on chip:
128k... 256K ... 2MB [DRAM-on-chip tech is promising]
Impact on Window Size
Window Buffer
In Streams
Out Streams
Insight: As M increases, this starts to resemble a superscalar
computer
The Continuum Architecture
Memory Hierarchy
In Streams
●
For too large a value of M:
–
Non-Sequential I/O --> caches
–
Caches --> less room for ALU (etc)
Out Streams
Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Thought: Can we augment window-buffer limit by a loopback
feature?
Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Memory
Thought: What do we gain by allowing a finite delay in the
loopback stream?
Streaming Networks: Primitive
Streaming Networks: 1:N
[Hanrahan
model]
Streaming Networks: N:1
Streaming Networks: The Ugly
Versatility of Streaming Networks?
●
Question: What algorithms can we support
here? How?
–
●
Both from a Theoretical and Practical view
We have experimented with graphics problems
only:
–
Stream compression, visibility & culling, level of
detail
New Concepts with Streaming Networks
●
An individual processor's cost is small
●
Highly flexible: use high level ideas of Dataflow
●
–
Multiple streams in and out
–
Interleaving or non-interleaved
–
Scalable window size
Open to entirely new concepts
–
E.g. How do you add more memory in this system?
Summary
●
●
●
●
●
Systems are easily built on the basis of
streaming I/O and memory models
By design, it makes maximum use of
hardware: very very efficient
Continuum of Architectures:
Pure Streaming to Superscalar
Stream processors are trivially chained, even
in cycles
Such a chained architecture may be higly
flexible:
–
Experimental evidence & systems work
–
Dataflow literature
–
Streaming literature
Download