Uploaded by kjashanpreet405

008. Architecture Classifications

advertisement
Architecture Classifications
Prof. (Dr.) Parul Goyal , Professor
High Performance Computing BCSE-526
B.Tech Eighth Semester
Computer Science & Engineering,
M. M. Engineering College,
Maharishi Markandeshwar (Deemed to Be University),
Mullana, Ambala - 133207
•Flynn’s taxonomy
•A way of describing the information flow in
computers: architectural definition
•Information is divided into instructions (I)
and data (D)
•There can be single (S) or multiple instances
of both (M)
•Four combinations: SISD,SIMD,MISD,MIMD
SISD
• Single Instruction, Single Data
• An absolutely serial execution model
• Typically viewed as describing a serial computer, but todays CPUs
exploit parallelism
Single data element
Single processor
P
M
SIMD
• Single Instruction, Multiple Data
• In this case one instruction is applied to multiple data streams at the
same time
K
P
Ma
P
Mb
P
Mc
Single instruction processor K,
broadcasts instruction to processing
elements (PEs)
Each processor typically
has its own data memory
Array of processors
MISD
• Multiple Instruction, Single Data
• Largely useless definition (not important)
• Closest relevant example would be a cpu than can `pipeline’
instructions
Ma
Each processor has its
own instruction stream
but operates on the same
data stream
Mi
P
Mi
P
Mi
P
Example: systolic array, network
of small elements connected in
a regular grid operating under
a global clock, reading and writing
elements from/to neighbours.
MIMD
• Multiple Instruction, Multiple Data
• Covers a host of modern architectures
M
M
M
P
P
P
P
Processors have independent
data and instruction streams.
Processors may communicate
directly or via shared memory.
M
Instruction Set Architecture
• ISA – interface between hardware and software
• ISAs are typically common to a cpu family e.g. x86, MIPS (more alike
than different)
• Assembly language is a realization of the ISA in a form easy to
remember (and program)
Key Concept in ISA evolution and CPU
design
• Efficiency gains to be had by executing as many operations per clock
cycle as possible
• Instruction level parallelism (ILP)
• Exploit parallelism within the instruction stream
• Programmer does not see this parallelism explicitly
• Goal of modern CPU design – maximize the number of instructions per
clock cycle (IPC), equivalently reduce cycles per instruction (CPI)
ILP versus thread level parallelism
• Many modern programs have more than one (parallel) “thread” of
execution
One “thread”
Instructions
• Instruction level parallelism breaks down a single thread of execution
to try and find parallelism at the instruction level
3
3 2 1
2
1
These instructions
are executed in
parallel even though
there is one thread
ILP techniques
• The two main ILP techniques are
• Pipelining – including additional techniques such as out-of-order
execution
• Superscalar execution
Pipelining
• Multiple instructions overlapped in execution
• Throughput optimization: doesn’t reduce time for
individual instructions
Instr 12
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
3 Instr 2 Instr 1
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
Design sweetspot
• Pipeline stepping time is determined by slowest operation in pipeline
• Best speed-up: if all operations take same amount of time
• Net time per instruction=stepping time/pipeline
stages
• Perfect speed up factor = # pipeline stages
• Never achieved: start up overheads to consider
Pipeline compromises
Time to issue instruction
10ns 10ns
5ns
10ns
5ns
10ns
5ns
=55ns
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
Instruction
10ns 10ns
10ns
10ns
10ns 10ns
10ns
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
These stages take longer than
necessary
=70ns
Superscalar execution
• Careful about definitions: superscalar execution is not simply about
having multiple instructions in flight
• Superscalar processors have more than one of a given functional unit
(such as the arithmetic logic unit (ALU) or load/store)
Benefits of superscalar design
• Having more than one functional unit of a given type can help
schedule more instructions within the pipeline
• The Pentium IV pipeline was 20 stages deep!
• Enormous throughput potential but big pipeline stall penalty
• Incorporation of multiple units into the pipeline is sometimes called
superpipelining
Other ways of increasing ILP
• Branch prediction
• Predict which path will be taken by assigning certain probabilities
• Out of order execution
• Independent operations can be rescheduled in the instruction stream
• Pipelined functional units
• Floating point units can be pipelined to increase throughput
Limits of ILP
• See D. Wall “Limits of ILP” 1991
• Probability of hitting hazards (instructions that
cannot be pipelined) increases with added length
• Instruction fetch and decode rate
• Remember the “von Neumann” bottleneck? Would be
nice to have single instruction for multiple operations…
• Branch prediction –
• Multiple condition statements increase branches severely
• Cache locality and memory limitations
• Finite limits to effectiveness of prefetch
Scalar Processor Architectures
‘Scalar’
Pipelined
Functional unit parallelism,
e.g. load/store and arithmetic
units can be used in parallel
(instructions in parallel)
Superscalar
Multiple functional units,
e.g. 4 floating point units
can operate at same time
Modern processors exploit parallelism,
and can’t really be called SISD
Complex Instruction Set Computing
• CISC – older design idea (x86 instruction set is CISC)
• Many (powerful) instructions supported within the ISA
• Upside: Makes assembly programming much easier (lots of assembly
programming in 60-70’s)
• Upside: Reduced instruction memory usage
• Downside: designing CPU is much harder
Reduced Instruction Set Computing
• RISC – newer concept than CISC (but still old)
• ARM, Intel, AMD, RISC-V(!), all RISC designs
• Small instruction set, CISC type operation becomes a
chain of RISC operations
• Upside: Easier to design CPU
• Upside: Smaller instruction set => higher clock speed
• Downside: assembly language typically longer (compiler
design issue though)
• Most modern x86 processors are implemented using
RISC techniques
Birth of RISC
• Roots can be traced to three research projects
• IBM 801 (late 1970s, J. Cocke)
• Berkeley RISC processor (~1980, D. Patterson)
• Stanford MIPS processor (~1981, J. Hennessy)
• Stanford & Berkeley projects driven by interest
in building a simple chip that could be made in
a university environment
• Commercialization benefitted from 3
independent projects
• Berkeley Project -> begat Sun Microsystems
• Stanford Project -> begat MIPS (used by SGI)
RISC processors
• Complexity has nonetheless increased significantly
• Superscalar execution (where CPU has multiple functional units of
the same type e.g. two add units) require complex circuitry to control
scheduling of operations
• A digression: What if we could remove the scheduling complexity by
using a smart compiler…?
RISC behemoth: ARM
• Most common chips in the world are now based on designs from
Advanced Risc Machines (ARM)
• Started out 36 years ago building microcomputers in UK
• Licences ISA out to other companies
• Apple, Nvidia, Samsung, AMD, Broadcom, Fujitsu, Amazon, Huawei
and Qualcomm all use ARM technology
VLIW & EPIC
• VLIW – very long instruction word
• Idea: pack a number of
noninterdependent operations into one
long instruction
• Strong emphasis on compilers to
schedule instructions
• When executed, words are easily broken
up and allow operations to be
dispatched to independent execution
units
Instr 1
Instr 2
Instr 3
3 instructions scheduled
into one long instruction word
VLIW & EPIC II
• Natural successor to RISC – designed to avoid the need for complex
scheduling in RISC designs
• VLIW processors should be faster and less expensive than RISC
• EPIC – explicitly parallel instruction computing, Intel’s
implementation (roughly) of VLIW
• ISA is called IA-64
VLIW & EPIC III
• Hey – it’s 2021, why aren’t we all using Intel Itanium processors?
• AMD figured out an easy extension to make x86 support 64 bits &
introduced multicore
• Backwards compatibility + “good enough performance” + poor
Itanium compiler performance killed IA-64
RISC vs CISC recap
RISC (popular by mid 80s)
Operations on registers
CISC (pre 1970s)
Operations directly on memory
Pro: Small instruction set
makes design easy
Pro: decreased CPI, but also get
faster CPU through easier design
(tc reduced)
Pro: Many powerful instructions,
easy to write assembly language*
Con: complicated instructions
must be built from simpler ones
Con: Efficient compiler
technology absolutely essential
Pro: Reduced memory
requirement for instructions,
reduced number of total
instructions (Ni)*
Con: ISA often large and
wasteful (20-25% usage)
Con: ISA hard to debug during
development
Who “won”? – Not VLIW!
• Modern x86 are RISC-CISC hybrids
• ISA is translated at hardware level to shorter instructions
• Very complicated designs though, lots of scheduling
hardware
• MIPS, Sun SPARC, DEC Alpha were much truer
implementations of the RISC ideal
• Modern metric for determining RISCkyness of
design: does the ISA have LOAD STORE instructions
to memory?
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based
(B5000 1963)
Concept of a Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
(Vax, Intel 432 1977-80)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
LIW/”EPIC”? (IA-64. . .1999)
Simultaneous multithreading
• Completely different technology to ILP
• NOT multi-core
• Designed to overcome lack of fine grained parallelism
in code
• Idea is to fill any potential gaps in the processor
pipeline by switching between threads of execution on
very short time scales
• Requires programmer to have created a parallel
program for this to work though
• One physical processor looks like two logical
processors
Motivation for SMT
• Strong motivation for SMT: memory latency making load operations
take longer and longer
• Need some way to hide this bottleneck (memory wall again!)
• SMT: switch over execution to threads that have their data and
execute those
• TERA MTA (bought by Cray)
attempt to design computer
entirely around this concept
SMT Example: IBM Power 9
• 12x to 24x cores, each
core can support upto
8 threads
• SMT gives ~40-50%
improvement in
performance for 1-2
threads
• Not bad
• Intel Hyperthreading ~
20-30% improvement
• 8 threads gets to
100% performance
increase
Multiple cores
• Simply add more CPUs
• Easiest way to increase throughput now
• Why do this?
• Response to problem of increasing power output on modern CPUs
• We’ve essentially reached the limit on improving individual core speeds
• Design involves compromise: n CPUs must now share memory bus –
less bandwidth to each
Intel & AMD multi-core processors
• Intel 56-core processors
• “Xeon Platinum”
• Design envelope 400W, but divide
by number of processors => each
core is v. power efficient
• $20k each(!)
• AMD has 64 core processors
• “Ryzen threadripper”
• 280 W design envelope
• Individual cores not as good as
Intel though (20% less speed)
• $4k
RISC-V (2010)
• A new approach to CPU design
• “Linux of processor design”
• ISA design available via open source licenses
• No fees to use it
• Design tools readily available
• Dozens of CPU designs now created based on RISC-V
• Further opens up the possibility of domain specific hardware
Summary
• Flynn’s taxonomy categorizes instruction and data
flow in computers
• Modern processors are MIMD
• Pipelining and superscalar design improve CPU
performance by increasing the instructions per
clock
• CISC/RISC design approaches appear to be reaching
the limits of their applicability
• In the absence of improved single core
performance, designers are simply integrating more
cores
Download