Stream Caching: Mechanisms for General Purpose Stream Processing Nat Duca Jonathan Cohen Johns Hopkins University Peter Kirchner IBM Research Talk Outline ● ● ● Objective: reconcile current practices of CPU design with stream processing theory Part 1: Streaming Ideas in current architectures – Latency and Die-Space – Processor types and tricks Part 2: Insights about Stream Caches – Could window-based streaming be the next step in computer architecture? Streaming Architectures ● Graphics processors ● Signal processors ● Network processors ● Scalar/Superscalar processors ● Data stream processors? ● Software architectures? What is a Streaming Computer? ● Two [overlapping] ideas – A system that executes strict-streaming algorithms [unbounded N, small M] – A general purpose system that is geared toward general computation, but is best for the streaming case ● ● Big motivator: ALU-bound computation! To what extent do present computer architectures serve these two views of a streaming computer? [Super]scalar Architectures ● ● ● Keep memory latency from limiting computation speed Solutions: – Caches – Pipelining – Prefetching – Eager execution / branch prediction [the super in superscalar] These are heuristics to locate streaming patterns in unstructured program behavior By the Numbers, Data ● ● ● Optimized using caches, pipelines, and eagerexecution – Random: 182MB/s – Sequential: 315MB/s Optimizing with prefetching – Random: 490MB/s – Sequential: 516MB/s Theoretical Maximum: 533MB/s By the Numbers, Observations ● ● ● ● Achieving full throughput on a scalar CPU requires either – (a) prefetching [requires advance knowledge] – (b) sequential access [no advances req'd] Vector architectures hide latency in their instruction set using implicit prefetching Dataflow machines solve latency using automatic prefetching Rule 1: Sequential I/O simplifies control and access to memory, etc Superscalar (e.g. P4) Local Memory Hierarchy Prefetc Cache h Superscalar (e.g. P4) Local Memory Hierarchy Cache Prefetch The P4, by surface area, is about 95% cache, prefetch, and branchprediction logic. The remaining area is primarily the floating point ALU. Pure Streaming (e.g. Imagine) In Streams Out Streams Can We Build This Machine? Local Memory Hierarchy In Streams ● Out Streams Rule 2: Small memory footprint allows more room for ALU --> more throughput Part II: Chromium ● Pure stream processing model ● Deals with OpenGL command stream – ● ● Begin(Triangles); Vertex, Vertex, Vertex; End; Record splits are supported, joins are not You perform useful computation in Chromium by joining together Stream Processors into a DAG – Note: DAG is constructed across multiple processors (unlike dataflow) Chromium w/ Stream Caches ● We added join capability to Chromium for the purpose of collapsing multiple records to one – ● ● ● Incidentally: this allows windowed computations Thought: there seems to be direct connection between streaming-joins and sliding-windows Because we're in software, the windows can become quite big without too much hassle What if we move to hardware? Windowed Streaming Window Buffer In Streams Uses for Window Buffer of size M: Store program structures of up to size M ●Cache M input records, where M << N ● Out Streams Windowed Streaming Window Buffer In Streams Out Streams Realistic values of M if you stay exclusively on chip: 128k... 256K ... 2MB [DRAM-on-chip tech is promising] Impact on Window Size Window Buffer In Streams Out Streams Insight: As M increases, this starts to resemble a superscalar computer The Continuum Architecture Memory Hierarchy In Streams ● For too large a value of M: – Non-Sequential I/O --> caches – Caches --> less room for ALU (etc) Out Streams Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Thought: Can we augment window-buffer limit by a loopback feature? Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Memory Thought: What do we gain by allowing a finite delay in the loopback stream? Streaming Networks: Primitive Streaming Networks: 1:N [Hanrahan model] Streaming Networks: N:1 Streaming Networks: The Ugly Versatility of Streaming Networks? ● Question: What algorithms can we support here? How? – ● Both from a Theoretical and Practical view We have experimented with graphics problems only: – Stream compression, visibility & culling, level of detail New Concepts with Streaming Networks ● An individual processor's cost is small ● Highly flexible: use high level ideas of Dataflow ● – Multiple streams in and out – Interleaving or non-interleaved – Scalable window size Open to entirely new concepts – E.g. How do you add more memory in this system? Summary ● ● ● ● ● Systems are easily built on the basis of streaming I/O and memory models By design, it makes maximum use of hardware: very very efficient Continuum of Architectures: Pure Streaming to Superscalar Stream processors are trivially chained, even in cycles Such a chained architecture may be higly flexible: – Experimental evidence & systems work – Dataflow literature – Streaming literature