Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

advertisement
Larrabee
Eric Jogerst
Cortlandt Schoonover
Francis Tan
Larrabee
• Intel’s new approach to
a GPU
• Considered to be a
hybrid between a multicore CPU and a GPU
• Combines functions of a
multi-core CPU with the
functions of a GPU
Larrabee
Larrabee
FETCH
Fetch
• Utilizes a hardware prefecther
• Supports four threads of execution
– Separate register files for each thread
– Switches threads in order to cover cases where
the compiler is unable to schedule code without
stalls or if the prefetcher has not received new
instructions
– Inactive thread data is written to the core’s local
L2 cache
Larrabee
PIPELINE ORGANIZATION
Pipeline
• Pipeline derived from the dual-issue Pentium
processor, which is 5-stages
– Short, inexpensive execution pipeline
• Pairing rules for primary and secondary
instruction pipes are deterministic
– Allows compilers to perform offline analysis with a
wide scope
Pipeline
• Pairing rules for primary and secondary instruction
pipes are deterministic
– Allows compilers to perform offline analysis with a wide
scope
• All instructions can be issued on the primary pipeline
– Minimizes the combinational problems for a compiler
• Secondary pipeline can execute a large x86 instruction
set
– Small and cheap
– Power wasted by failing to dual-issue on every cycle is
minimal
Pipeline
• Each core has own pipeline
– Based upon the 5 stage Pentium
– Dual issues instructions
– In order execution
• Pipeline is shared between threads
– Hardware can switch between threads that have
instructions that have instructions ready to
execute
Pipeline
• Designed software-rendering pipeline to
minimize the number of locks and other
synchronization events
• Graphics-rendering pipeline written with highlevel languages and tools
– Enables developers to add innovative rendering
capabilities
Larrabee
SIMD ORGANIZATION
Vector Processor Unit
• 16-wide vector processor unit (VPU)
– executes integer, single-precision float, and
double-precision float instructions
– VPU and register are approximately one-third the
area of the processor core
• Tradeoff
– Increased computational density
– Wider VPU’s have higher utilization
Vector Processor Unit
• VPU instructions can be predicated by a mask
register
• Mask controls which parts of a vector register
or memory location are written and which are
left untouched
• Advantages
– Reduces branch misprediction penalties
– Gives instruction scheduler greater freedom
Number of Cores
• Many-core processor
– Planned to have 24 to 48
cores
Number of Cores
Number of Cores
Larrabee
SYSTEM ON-CHIP COMPONENTS
System On-Chip Components
• x86 computer cores - Dual issue, in order
processors that support the x86 protocol with
Larrabee extensions. Connected to ring
network and high bandwidth connection to
adjacent L2 Cache subset.
System On-Chip Components
• L2 Cache subsets
– High bandwidth access to adjacent CPU
– Connected directly to the ring network
– Coherent cache, uses the ring network to check
coherency when allocating new cache lines
System On-Chip Components
• Ring Network Nodes
– Simple bi-directional routers with a 512 bit data
path in each direction (1024 bit total bandwidth)
– Organized in rings of 8-16 cores and other devices
– Interconnected with other rings
– All data moved between cores and fixed
functional units passes through the ring network
System On-Chip Components
• Fixed function logic components
– Provides rasterization, interpolation and other
commonly needed functions
– Directly connected to the ring network
– Will be spread among the cores to provide lower
latency and load balancing on the ring network
System On-Chip Components
• Memory & I/O interface
– Provides and manages communication between
the Ring Network and off chip devices.
– Manages initial routing and tasking of cores
Larrabee
MEMORY HIERARCHY
Memory Interface
Larrabee
ON-CHIP INTERCONNECT
On-Chip Interconnect
• Ring interconnect bus
• Similar to the Sony Cell processor.
Ring Bus
Ring Bus Features
•
•
•
•
Bi-directional
512 Bits in each direction
Presumably running at core speed.
Each element can take from one direction on
odd CC and other direction on even CC.
Ring Bus Comparisons
• Compared to AMD’s R600/RV670 bus, it is half
the bit-width.
• The higher clock speed of Larrabee’s bus
should make up for the difference in
bandwidth.
Ring Bus Tradeoff Analysis
Ring Bus Tradeoff Analysis
Pros:
•Straightforward, not complex
•Able to deliver high bandwidth
•Great performance if memory
clients need high bandwidth.
Cons:
•Waste of chip area if most
applications don’t need high
memory bandwidth
•That area could be spent
elsewhere to increase
performance in a different way.
Larrabee
MULTITHREADING ORGANIZATION
Multithreading Organization
•
•
•
•
Superscalar
In-Order
Four Threads of execution
Dual issue (with a vector processing unit)
Comparison to OO Execution
# CPU cores:
2 out-of-order
10 in-order
Instruction issue:
4 per clock
2 per clock
VPU per core:
4-wide SSE
16-wide
L2 cache size:
4 MB
4 MB
Single-stream:
4 per clock
2 per clock
Vector throughput:
8 per clock
160 per clock
Larrabee Vector Processor
8 per clock
Scheduling Policy
• Software Controlled
• More flexible due to the software controlled
scheduling than a typical GPU.
Software Controlled Scheduling
Pros
• Flexible: can choose the
scheduler to suit the
application.
• Worst case won’t be so bad.
(As compared to a hardware
encoded scheduling policy)
Cons
• Overhead of scheduler
takes a bite out of
performance
• Programmer overhead of
selecting the correct
scheduler.
Criticism
• NVIDIA
– “like a GPU from 2006”
– Unrealistic performance projections
– Motivated by interest to retain market share
Possible Market
• Dreamworks Animation
• Xbox / Playstation
• Scientific research
Questions?
Download