Increasing Machine Throughput
Superscalar Processing
Multiprocessor Systems
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Multiprocessing
• There are 3 generic ways to do multiple things “in parallel”
– Instruction-level Parallelism (ILP)
• Superscalar
– doing multiple instructions (from a single program) simultaneously
– Data-level Parallelism (DLP)
• Do a single operation over a larger chunk of data
– Vector Processing
– “SIMD Extensions” like MMX
– Thread-level Parallelism (TLP)
• Multiple processes
– Can be separate programs
– …or a single program broken into separate threads
• Usually used on multiple processors, but not required
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Superscalar Processors
•
With pipelining, can you ever reduce the CPI to < 1?
•
Only if you issue > 1 instruction per cycle
•
Superscalar means you add extra hardware to the pipeline to
allow multiple instructions (usually 2-8) to execute
simultaneously in the same stage
–
This is a form of parallelism, although one never refers to a superscalar
machine as a parallel machine
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Dynamic Scheduling
I n s tr u c ti o n fe t c h
I n -o r d e r i s s u e
a n d d e c o d e u n it
R e se r v a t io n
s ta tio n
F u n ct io n a l
u n it s
I n te g e r
R e s e r v ati o n
sta ti o n
…
R e s e r v a tio n
s ta ti o n
R e s e r v a tio n
s ta ti o n
In te g e r
…
F lo a ti n g
Lo a d /
p o int
S to r e
O ut -o f-o r d e r e x e c u te
In - o r d e r c o m m it
C o m m it
u ni t
 4 reservation stations for  4 separate pipelines
•
–
Each pipeline may have a different depth
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
How to Fill Pipeline Slots
• We’ve got lots of room to execute – now how do we fill the slots?
• This process is called Scheduling
– A schedule is created, telling instructions when they can execute
• 2 (very different) ways to do this:
– Static Scheduling
• Compiler (or coder) arranges instructions into an order which can be
executed correctly
– Dynamic Scheduling
• Hardware in the processor reorders instructions at runtime to
maximize the number executing in parallel
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Static Scheduling Case Study
• Intel’s IA-64 architecture ( AKA Itanium )
– First appeared in 2001 after enormous amounts of hype
• Independent instructions are collected by the compiler into a
bundle.
– The instructions in each bundle are then executed in parallel
• Similar to previous static scheduling schemes
– However, the compiler has more control than standard VLIW machines
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Dynamic Pipeline Scheduling
•
Allow the hardware to make scheduling decisions
•
•
In order issue of instructions
Out of order execution of instructions
•
In case of empty resources:
–
•
The hardware will look ahead in the instruction stream to see if there
are any instructions that are OK to execute
As they are fetched, instructions get placed in reservation stations –
where they wait until their inputs are ready
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Dynamic Scheduling
I n s tr u c ti o n fe t c h
I n -o r d e r i s s u e
a n d d e c o d e u n it
R e se r v a t io n
s ta tio n
F u n ct io n a l
u n it s
I n te g e r
R e s e r v ati o n
sta ti o n
…
R e s e r v a tio n
s ta ti o n
R e s e r v a tio n
s ta ti o n
In te g e r
…
F lo a ti n g
Lo a d /
p o int
S to r e
O ut -o f-o r d e r e x e c u te
In - o r d e r c o m m it
C o m m it
u ni t
 4 reservation stations for  4 separate pipelines
•
–
Each pipeline may have a different depth
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
Committing Results
•
When do results get committed / written?
–
–
•
In order commit
Out of order commit – very dangerous!
Advantages:
–
–
–
•
Hide load stalls
Hide memory latency
Approach 100% processor utilization
Disadvantages:
–
–
–
LOTS of power-hungry hardware
Branch prediction is crucial
Complex control
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Dynamic Scheduling Case Study
• Intel’s Pentium 4
– First appeared in 2000
• Possible for 126
instructions to be
“in-flight” at one time!
• Processors have gone
“backwards” on this
since 2003
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Thread-level Parallelism (TLP)
– If you have multiple threads…
• by having multiple programs running, or
• writing a multithreaded application
– …you can get higher performance by running these threads:
• On multiple processors, or
• On a machine that has multithreading support
– SMT – (AKA “Hyperthreading”)
• Conceptually these are very similar
– The hardware is very different
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Parallel Processing
•
The term parallel processing is usually reserved for the situation in
which a single task is executed on multiple processors
–
Discounts the idea of simply running separate tasks on separate processors –
a common thing to do to get high throughput, but not really parallel processing
Key questions in design:
1.
How do parallel processors share data and communicate?
–
2.
How are the processors connected?
–
•
shared memory vs distributed memory
single bus vs network
The number of processors is determined by a combination of #1 and #2
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
The Jigsaw Puzzle Analogy
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Serial Computing
Suppose you want to do a jigsaw puzzle
that has, say, a thousand pieces.
We can imagine that it’ll take you a
certain amount of time. Let’s say
that you can put the puzzle together in
an hour.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Shared Memory Parallelism
If Alice sits across the table from you,
then she can work on her half of the
puzzle and you can work on yours.
Once in a while, you’ll both reach into
the pile of pieces at the same time
(you’ll contend for the same resource),
which will cause a little bit of
slowdown. And from time to time
you’ll have to work together
(communicate) at the interface
between her half and yours. The
speedup will be nearly 2-to-1:
combined, it might take 35 minutes
instead of 30.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
The More the Merrier?
Now let’s put Bob and Charlie on the
other two sides of the table. Each of
you can work on a part of the puzzle,
but there’ll be a lot more contention
for the shared resource (the pile of
puzzle pieces) and a lot more
communication at the interfaces. So
you will get noticeably less than a
4-to-1 speedup, but you’ll still have
an improvement, maybe something
like 3-to-1: the four of you can get it
done in 20 minutes instead of an hour.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Diminishing Returns
If we now put Dave and Ed and Frank
and George on the corners of the
table, there’s going to be a whole lot
of contention for the shared resource,
and a lot of communication at the
many interfaces. So the speedup
you’ll get will be much less than we’d
like; you’ll be lucky to get 5-to-1.
So we can see that adding more and
more workers onto a shared resource
is eventually going to have a
diminishing return.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Distributed Parallelism
Now let’s try something a little different. Let’s set up two
tables, and let’s put you at one of them and Alice at the other.
Let’s put half of the puzzle pieces on your table and the other
half of the pieces on Alice’s. Now you can both work
completely independently, without any contention for a shared
resource. BUT, the cost per communication is MUCH higher
(you have to scootch your tables together), and you need the
ability to split up (decompose) the puzzle pieces reasonably
evenly, which may be tricky to do for some puzzles.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
More Distributed Processors
It’s a lot easier to add
more processors in
distributed parallelism.
But, you always have to
be aware of the need to
decompose the problem
and to communicate
among the processors.
Also, as you add more
processors, it may be
harder to load balance
the amount of work that
each processor gets.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Load Balancing
Load balancing means ensuring that everyone completes
their workload at roughly the same time.
For example, if the jigsaw puzzle is half grass and half sky,
then you can do the grass and Alice can do the sky, and then
you only have to communicate at the horizon – and the
amount of work that each of you does on your own is
roughly equal. So you’ll get pretty good speedup.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Load Balancing
Load balancing can be easy, if the problem splits up into
chunks of roughly equal size, with one chunk per
processor. Or load balancing can be very hard.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Load Balancing
Load balancing can be easy, if the problem splits up into
chunks of roughly equal size, with one chunk per
processor. Or load balancing can be very hard.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Load Balancing
Load balancing can be easy, if the problem splits up into
chunks of roughly equal size, with one chunk per
processor. Or load balancing can be very hard.
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
How is Data Shared?
• Distributed Memory Systems
– Each processor (or group of processors) has its own memory space
– All data sharing between “nodes” is done using a network of some kind
• i.e. Ethernet
– Information sharing is usually explicit
• Shared Memory Systems
– All processors share one memory address space and can access it
– Information sharing is often implicit
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Shared Memory Systems
• Processors all operate independently,
but operate out of the same memory.
• Some data structures can be read
by any of the processors
• To properly maintain ordering in
our programs, synchronization
primitives (locks/semaphores)
are needed!
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Processor
Processor
Processor
Cache
Cache
Cache
Single bus
Memory
I/O
Dan Ernst
Multicore Processors
• A Multicore processor is simply a shared memory machine where the
processors reside on the same piece of silicon.
• “Moore’s law” continuing means we have more transistors to put on a chip
– Doing more superscalar or superpipelining will not help us anymore!
– Use these transistors to make every system a multiprocessor
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Example Cache Coherence Problem
P2
P1
u=?
$
P3
3
u=?
4
$
5
$
u :5 u = 7
u :5
I/O devices
1
u :5
2
Memory
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Cache Coherence
• According to Webster’s dictionary …
– Cache: a secure place of storage
– Coherent: logically consistent
• Cache Coherence: keep storage logically consistent
– Coherence requires enforcement of 2 properties
1. Write propagation
–
All writes eventually become visible to other processors
2. Write serialization
– All processors see writes to same block in same order
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Cache Coherence Solutions
• Two most common variations:
– “Snoopy” schemes
• rely on broadcast to observe all coherence traffic
• well suited for buses and small-scale systems
• example: Intel x86
– Directory schemes
• uses centralized or “regional” information to avoid broadcast
• scales fairly well to large numbers of processors
• example: SGI Origin/Altix
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Why Cache Coherent Shared Memory?
 Pluses




For applications - looks like multitasking uniprocessor
For OS - only evolutionary extensions required
Easy to do communication without OS
Software can worry about correctness first and then performance
 Minuses
 Proper synchronization is complex
 Communication is implicit so may be harder to optimize
 More work for hardware designers
 Result
 Symmetric Multiprocessors (SMPs) are the most successful parallel
machines ever
 And the first with multi-billion-dollar markets!
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Intel Core 2 Duo
Multiple levels of cache coherence
– On-chip, the L1 caches must stay
coherent using the L2 as the back-ing store
– Off-chip, the processor must
support standard SMP
processor configurations
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Distributed Memory Systems
•
Hardware in which each processor (or group of processors)
has its own private memory.
•
Communication is achieved through some kind of network
–
–
•
This can be as simple as Ethernet…
…or far more customized, if communication is important
Examples
–
–
–
Cray XE6
A rack of Dell Poweredge 1950s
Folding@HOME
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
The Most Common Distributed Performance
System: Clustering
• A parallel computer built out of commodity
hardware components
– PCs or server racks
– Commodity network (like ethernet)
– Often running a free-software OS like Linux
with a low-level software library to facilitate
multiprocessing
• Use software to send messages between
machines
– Standard is to use MPI (message passing
interface)
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
What is a Cluster?
• “… [W]hat a ship is … It's not just a keel and hull and a deck and sails.
That's what a ship needs. But what a ship is ... is freedom.”
– Captain Jack Sparrow
“Pirates of the Caribbean”
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
What a Cluster is ….
• A cluster needs of a collection of small computers, called nodes,
hooked together by an interconnection network
• It also needs software that allows the nodes to communicate
over the interconnect.
• But what a cluster is … is all of these components working
together as if they’re one big computer (sometimes called a
“supercomputer”)
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
What a Cluster is ….
• Nodes
– PCs/Workstations
– Server rack nodes
• Interconnection network
–
–
–
–
Ethernet (“GigE”)
Myrinet (“10GigE”)
Infiniband (low latency)
The Internet
• (typically called a “Grid”)
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
• Software
– OS
• Generally Linux
– Redhat / CentOS / SuSE
• Windows HPC Server
– Libraries (MPICH2, PBLAS, MKL,
NAG)
– Tools (Torque/Maui, Ganglia,
GridEngine)
Dan Ernst
An Actual (Production) Cluster
Interconnect Nodes
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Other Actual Clusters…
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
What a Cluster is NOT…
• At the high end, many supercomputers are made with custom
parts:
–
–
–
–
Custom backplane/network
Custom/Reconfigurable processors
Extreme Custom cooling
Custom memory system
• Examples:
– IBM Blue Gene
– Cray XT/XE
– SGI Altix (not even really a distributed memory machine…kind of)
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Flynn’s Taxonomy of Computer Systems (1966)
A simple model for categorizing computers and computation: 4 categories:
1.
SISD – Single Instruction Single Data
–
2.
the standard uniprocessor model
SIMD – Single Instruction Multiple Data (DLP)
–
–
3.
Full systems that are “true” SIMD are no longer in use
Many of the concepts exist in vector processing and SIMD extensions
MISD – Multiple Instruction Single Data
–
4.
doesn’t really make sense
MIMD – Multiple Instruction Multiple Data (TLP)
–
the most common model in use
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
MIMD
•
Multiple instructions are applied to multiple data
•
The multiple instructions can come from the same program, or
from different programs
–
•
Generally “parallel processing” implies the first
Most modern multiprocessors are of this form
–
–
IBM Blue Gene, Cray T3D/T3E/XT3/4/5/6, SGI Origin/Altix
Beowulf and other “Homebrew” or third-party clusters
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
“True” SIMD
•
A single instruction is applied to multiple data elements in
parallel – same operation on all elements at the same time
•
Most well known examples are:
–
–
Thinking Machines CM-1 and CM-2
MasPar MP-1 and MP-2
•
•
All are out of existence now
SIMD requires massive data parallelism
•
Usually have LOTS of very very simple processors (e.g. 8-bit CPUs)
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
Vector Processors
•
Closely related to SIMD
–
–
–
•
Cray J90, Cray T90, Cray SV1, NEC SX-6
Looked to be “merging” with MIMD systems
•
Cray X1E, as an example
Now appears to be dropped in favor of GPUs
Use a single instruction to operate on an entire vector of data
–
–
–
Difference from “True” SIMD is that data in a vector processor is not
operated on in true parallel, but rather in a pipeline
Uses “vector registers” to feed a pipeline for the vector operation
Generally have memory systems optimized for “streaming” of large
amounts of consecutive or strided data
•
(Because of this, didn’t typically have caches until late 90s)
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst
GPUs
Lots of “cores” on a single chip.
Programming model is unusual
• Probably closer to SIMD than Vector machines…
– Actually executes things in parallel
• …but its not really SI
– Threads each have a mind of their own
• I’ve seen them referred to as “SPMD” and “SPMT”
• More clearly, Flynn’s Taxonomy is not very good at describing
specific systems
– It’s better at describing a style of computation
– Sadly, everyone likes to categorize things
CS 352 : Computer Organization and Design
University of Wisconsin-Eau Claire
Dan Ernst