Topic 5 Processor Development

advertisement
Topic 5
Processor Development
AH Computing
Computer Architecture
SQA arrangements

Description of the evolution of the following microprocessor architectures: the
Power PC series, the Intel X86 series and the Intel IA-64 in terms, where
appropriate, of the following features and techniques:












increasing clock speeds
data bus widths
pipelining
superscalar processing
branch prediction
speculative loading of data and executing of instructions
predication
the number and function of registers used
SIMD
RISC
CISC
Explanation of the relationship between these developments and system
performance.
Introduction



From 1980s, microprocessor architecture has
developed rapidly, as a result of
Increasing miniaturisation of microelectronic
circuitry, which means that more and more
complex chip designs have become possible
and economically viable
The pressure form software developers to
design microprocessors with ever increasing
performance
Introduction

The first microprocessors were not general
purpose processors but were designed for
specific applications
Intel 4004 (1971)







the first complete CPU on one chip
the first commercially available microprocessor used
in calculators, data terminals, numeric control
systems etc.
16 general purpose registers
1KByte of data memory and 4Kbytes of instruction
memory
16 4-bit GP registers
Clock speed of 740 KHz
45 instructions
Intel 8080 (1974)





16-bit address bus, 8-bit
data bus
PC was 16 bits long
7 8-bit GP registers
Used in the first personal
computer, the Altair 8800
Others…Zilog Z-80,
Motorola/MOS 6502
Processor Development
Look at the evolution of families of processors
 Power PC
 Intel X86
 Intel I-64
Processor Development
Compare the following features and techniques
 Increasing clock speeds
 Data bus widths
 Pipelining
 Superscalar processing
 Branch prediction
 Speculative loading of data
 Predication
 The number and function of registers used
 SIMD
 RISC
 CISC
8086/88 (1979)
Pentium




Intel introduced superscalar architecture to the
Pentium processor
2 integer arithmetic and logic units
1 Floating Point unit
8 80-bit
X86 series evolution
Development of registers X86
8086
80286
Development of registers X86
80486
Development of registers X86
Pentium 3
Summary of X86
The X86 series of microprocessors can be characterised
as having:
 a relatively small number of registers (8 GP, 8 FP
and 8 SIMD)
 a large instruction set
 instructions of varying length
 many addressing modes
 These characteristics are typical of CISC (complex
instruction set computer) architecture. Other CISC
based processors include the IBM 370 and the
VAX11/780.
Questions (Scholar page 128)
5.
6.
7.
8.
9.
Sketch a graph of the increase in clock speeds from
the 8086 to the Pentium processor
Which of the X86 processors was the first to use
pipelining to improve performance?
How many registers has the (a) 8086, (b) 80286, (c)
80486 (d) Pentium
Which X86 chip was the first to have a superscalar
architecture?
The X86 series are considered to be CISC
processors. Justify this claim.
PowerPC series
Background



Improvements in processor capability and
operating systems led to the birth of the
Wintel PC
Wintel is portmanteau of Windows and Intel.
It usually means a computer based on an Intel
x86 compatible processor and running the
Microsoft Windows operating system.
Still dominates the laptop and desktop market
Motorola



At the same time Motorola was developing its
own family of microprocessors, the 68000
series
These were developed as 32-bit processors
from start
As a result, Apple was able to develop its
Macintosh computers with true graphical OS
from the start
Motorola 68000 (1979)







Same time as Intel 8086
8MHz clock speed
32-bit architecture
16-bit data bus, 24-bit address bus
16 32-bit registers (8 data, 8 address)
No segment registers required as direct
addressing used
Used pre-fetching to speed up execution
Motorola 68020 (1984)



32-bit data and address buses
Pipeline had 3 stages
256 cache added
Motorola 68040 (1991)




32-bit data and address buses
Pipeline had 6 stages
Floating point unit added
4Kbyte caches for data and programs added
Motorola 68060 (1994)



Superscalar – 3 execution units, 2 integer and
1 FP
10 stage pipelines
8Kbyte caches for data and programs
Motorola series




Used in Sun workstations, Apple Macintosh
computers, and later Atari computers
No longer in use in main computer market
Still used in embedded systems
Motorola and IBM designed the first
PowerPC chip to
Main Characteristics of Motorola series
In the final years of the 68000 processors, Apple,
Motorola and IBM defined a specification for open
system software and hardware, and Motorola and IBM
designed the first PowerPC chip to meet this
specification.
PowerPC


Acronym for “performance optimised with
enhanced RISC”
Compared with CISC-based X86



More registers
A smaller, but more efficient, instruction set
Less addressing modes
PowerPC





First chip 601 in 1993
32-bit chip with a 64-bit data bus
Clock speed of 60MHz
Up to 4 Gb of memory
Superscalar architecture 3 independent
execution units (integer, floating point and
branch processing) – each with a 6 stage
pipeline
Used in the
XBox
Used in the
Nintendo Wii
Power PC overview

Used in





Controllers in cars
Networking – routers and servers
Honda’s Asimo
Vehicle-Management Computer for the F-35
fighter jet
Playstation 3, Wii, Nintendo DS
All Power PC processors
have
•two sets of 32 programmer
accessible GP registers (64
bits wide)
•And a small number of
special purpose registers
Comparison of X86 with PowerPC
Direct addressing for Load,
Store and Branch instructions.
All other instruction address
internal registers
TRENDS important
Summary of table





clock speeds have increased by a factor of 50
in 10 years
bus speeds have increased by a factor of 20
the complexity (no. of transistors) has
increased by a factor of 20
on chip cache has increased
new features have been added.
Clock speeds



PowerPC chips had clock speeds lower than
CISC based designs
But more efficient RISC based technology
gave a better performance.
Clock speed alone cannot be used to compare
processors
Questions (Page 133)
10.
11.
12.
13.
14.
15.
16.
Which 3 companies cooperated in the design of the PowerPC specification?
What was the first PowerPC chip released, and when?
The 601 chip can be described as superscalar. How is this justified?
How many programmer accessible registers are there in all PowerPC chips?
Compare the X86 and PowerPC architectures in terms of
1.
a) instructions set
2.
b) instruction length
3.
c) addressing modes
What new feature did the G3 chip have which improved performance?
Which was the first PowerPC chip to have SIMD instructions?
1.
2.
3.
4.
5.
17.
18.
a) 601
b) 604e
c) G3
d) G4
e) G5
Why is clock speed not a good way of comparing a Windows PC with a Apple
Macintosh?
Other than in Apple computers, what are PowerPC chips used for?
Answers
Q10: Apple, Motorola, IBM
Q11: the 601 in 1993
Q12: it has 3 independent processing units - the floating point unit (FPU), the integer
ALU, and the system unit
Q13: 2 sets of 32 registers, each 64 bits wide
Q14: a) similar - X86 has 235 different instructions, PowerPC has 225
b) X86 has varied instruction lengths (1-11 bytes), the PowerPC instructions are all
exactly 4 bytes
c) the X86 has 11 addressing modes, the PowerPC has only 2
Q15: L2 "backside" cache on chip
Q16: d) G4
Q17: because the Mac uses the more efficient RISC architecture, a Mac with a lower
clock speed may outperform a Windows PC with a higher clock speed
Q18: IBM servers, Nintendo Game Cube, and a range of embedded applications
Intel IA-64
Intel IA-64


The X86 series reached its peak with the
Pentium 3, Pentium 4 and Athlon processors.
These are essentially CISC processors, using
pipelining and superscalar processing, but
with some RISC-like features. In 1994, Intel
and HP began work on designing a new 64-bit
architecture to replace the X86 series.
EPIC





Combination of RISC and CISC features, and is given
the description EPIC - explicitly parallel instruction
computing. There are 4 key features to the design:
instruction level parallelism - the compiler creates code
which uses the many parallel execution units of the
processor
use of VLIW - very long instruction words
use of predication - executing both branches of a
program, then discarding the "not chosen" branch results
use of speculative loading - use of large fast cache to
load data and instructions in advance of when they will
be required
X86
IA-64
X86
IA-64
X86
IA-64
X86
IA-64
VLIW





Very Long Instruction Words
Fetched from memory in bundles of 128 bits
Contains 3 instructions
Each of length 41 bits
Final 5 bits are a pointer, which indicates to the processor to
which of the many execution units each instruction should be
assigned.
IA-64 Execution Units




I-unit (integer and logical operations)
M-unit (load and store operations)
B-unit (branch instructions)
F-unit (floating point operations)
Pointer




5 bits = 32 different combinations
00000 – send instruction 1 to the M-unit,
instruction 2 to the I-unit, instruction 3 to another Iunit
11101 – send instruction 1 to the M-unit, instruction
2 to the F-unit and instruction 3 to the B-unit
The pointer is created by the compiler which
determines in advance whether or not instructions
can be executed in parallel
The Compiler
When the instruction
arrives at the
processor, the 3
instructions are
directed to the
appropriate
execution unit for
processing:
Summary of IA-64

Performance is enhanced by
 the use of VLIW reduces the number of
relatively slow memory fetches
 The sequencing of instructions being
determined by the compiler rather than
being dealt with at run time)
The Itanium processor

The first commercial version of the IA-64
architecture was massively superscalar
 11 execution units
 4 integer units
 2 floating point units
 3 branch units
 2 load/store units
The Itanium processor

It makes extensive use of




Predication
Speculative loading of both data and instructions
Executes 20 operations per cycle
Clock speed of 800MHz is the equivalent of
an X86 or PowerPC running at several GHz
The Itanium processor
It1 has
Exabyte
 128 64-bit registers for integer/logical/general
=1024
Petabytes
purpose
use
 128 82-bit registers for floating point and
=1024
x 1024
graphics
use Terabytes
 Data bus 128 bits wide
= 1024 x 1024 x 1024 Gigabytes
 Address bus 64 bits wide (potentially64
Exabytes of addressable memory)
Questions (Page 137)
19.
20.
21.
22.
The IA-64 uses VLIW. What does this mean?
Can the Itanium be described as a superscalar
architecture?
IA-64 chips use predication. Explain the
difference between predication and branch
prediction.
How can an 800MHz Itanium outperform a
2.5GHz Pentium?
Answers
Q19: VLIW = very large instruction word; the IA-64 fetches a 128
bit bundle containing 3 41-bit instructions during each memory
fetch
Q20: yes, it has 11 execution units which can operate in parallel
Q21: branch prediction mean "guessing" whether or not a branch
will be taken, and executing following instructions accordingly - if
the prediction is wrong, the pipeline will stall; predication means
executing instructions from both branches simultaneously, and
discarding the results from the branch which is not required
Q22: due to its parallel execution units, 10 stage pipeline, VLIW
memory accessing and use of predication and speculative loading,
the Itanium can process up to 20 operations per cycle.
Intel Itanium



Intel has released two processor families
using the brand: the original Itanium and the
Itanium 2.
Starting November 1, 2007, new members of
the second family are again called Itanium.
The processors are marketed for use in
enterprise servers and high-performance
computing systems.
Dual Core

Dual-core refers to a
CPU that includes
two complete
execution cores per
physical processor.
Parallel Computing
SQA arrangements

Description of how parallel computers
function referring to their use of:




local (cache) as well as main memory
pipelining
local pathways and packet switching to achieve
communication between CPUs.
Description of the performance benefits of
parallel computers.
Examples of parallel computing



Pipelining - executing one instruction while
fetching the next
Superscalar architecture– multiple execution
units all processing different operations
simultaneously
SIMD instructions – the same instruction
being applied to several data items at the same
time
Parallel Computing


Another approach is to have multiple
processors
This is the basis of most mainframe
computers and supercomputers
Parallel Computing



Using multiple processing elements simultaneously
to solve a problem.
accomplished by breaking the problem into
independent parts so that each processing element
can execute its part of the algorithm simultaneously
with the others.
The processing elements can include resources such
as a single computer with multiple processors,
several networked computers, specialized hardware,
or any combination of the above
Multiprocessing, mainframes and
supercomputers

Simplest –
several
processors
connected to
the same
system
bus…
Multiprocessing, mainframes and
supercomputers



Each processor has shared
access to memory and to I/O
devices
Master-slave – some systems
have one processor controlling
the others
Symmetrical Multiprocessing
(SMP)- In other systems all are
equal (up to 10 processors)
Multiprocessing


Not limited to mainframe systems
PowerMac G5 dual processor desktop system
has 2 G5 processors
Comparison
Massively Parallel Architectures
Massive parallel processing


(MPP) is a term used in computer architecture to
refer to a computer system with many independent
arithmetic units or entire microprocessors, that run in
parallel.
The term massive connotes hundreds if not
thousands of such units.

processors are arranged in an interconnected
array which serves as a network.

Early examples of such a system are the Distributed
Array Processor, the Goodyear MPP, the Connection
Machine, and the Ultracomputer.
Massively Parallel Architectures

Today's most powerful supercomputers are all
MP systems such as






Earth Simulator,
Blue Gene,
ASCI White,
ASCI Red,
ASCI Purple, and
ASCI Thor's Hammer.
Massively Parallel Architectures
Memory
 Each processor has access to its own local
memory or cache
 All processors can access a main (global)
memory by a systemwide bus
Massively Parallel Architectures


processors are pipelined - the results from
one processor can become the input for
another processor
as well as the system bus, there may be
local pathways connecting groups of
processors into clusters, and other
pathways connecting clusters
MP Architectures - communication
To achieve communication between processors, parallel
computers use:
 data pathways (buses) to connect clusters of
processors, as well as system buses to connect
processors and pipelines, enabling the results of one
CPU to flow into another
Or
 packet switching techniques similar to those used in
networks to manage the flow of data between
processors.
MP Architectures - communication


Packet switching techniques, similar to
those on a network, are used in which
data packets are assigned the addresses of
specific nodes (processors) on the array.
This enables any processor on the array
to access the local memory of any other
processor on the array or to pass data or
instructions to other processors.
MPP
Examples - Lucidor




It consists of 90 interconnected nodes. Each node
has two 90MHz Itanium 2 processors accessing 16K
of L1 cache, and 256K of L2 cache.
Each node can access the system bus via a 128 port
switch at a data transfer rate of 2Gbits per second.
In addition to the local memory, each node has
shared access to 6Gb of main memory.
As a result, the system can achieve data processing
rates of over 600GFlops per second.
Hitachi SR2201




from 8 up to 2048 processors. The processors (Hitachi
RISC chips) are arranged in a 3-dimensional grid to
maximise communication between them.
As with Lucidor, speeds of up to 600GFlos per second
can be achieved.
These systems are in use for a variety of applications,
including structural and crash analysis, fluid dynamics
research, quantum chemistry analysis and visualisation
tools.
All of these can make use of the parallel architecture, as
they require high speed processing of large amounts of
data.
Cray



The CrayT3D is a current example, with 2048
nodes arranged in a 3-dimensional grid.
Each node has 2 Alpha processors, with
access to individual cache and 8Mwords of
memory.
Cray claims that this system can process 1
trillion flops per second.
Cray 2
Blue Gene




Blue Gene is a computer architecture project
designed to produce several supercomputers,
designed to reach operating speeds in the PFLOPS
(petaFLOPS) range, and currently reaching sustained
speeds of nearly 500 TFLOPS (teraFLOPS).
Blue Gene/L has 65,536 processors.
Each is connected by 3 networks.
At the time of writing, Blue Gene/L is the fastest
computer in the world, achieving over 70Tflops per
second.
Blue Gene
Chip – 2 processors
Card – 2 chips
Node – 16 cards
Cabinet – 32 nodes
System – 64 cabinets
Blue Gene
Exercise

Research one of the following –







Earth Simulator,
Blue Gene,
ASCI White,
ASCI Red,
ASCI Purple, and
ASCI Thor's Hammer.
In terms of





Number of nodes
Number of processors at each node
Global memory
Processing power in teraflops per second
Applications
Past Paper Questions




2011 Q 14
2008 Q 13
2007 Q17a,b
2006 Q15b
Past Paper 2008
JGT(37) If the flag is set jump to location 37
Describe the problem that instruction JGT(37)
could cause for a processor using a pipeline.
2009
Mediatrain is a company which uses a high
performance computer system to produce multimedia
training projects. The computer system has a PowerPC
superscalar processor which has thirty two 64-bit
general purpose registers.
(a) The PowerPC is an example of a RISC processor.
RISC processors have a large number of general
purpose registers. Name three other features of a RISC
processor that distinguish it from a CISC
processor. (3)
2009 cont
c. Explain the benefit to the PowerPC processor
of having so many general purpose registers
(2)
d. Most of the instructions in the PowerPC
processor instruction set have an op-code and
an operand.
Describe the function of the op-code and the
operand. (2)
2009 cont
e. Superscalar processing involves the use of
multiple pipelines.
State a feature of the PowerPC processor
which makes it suited to superscalar
processing. Justify your answer. (4)
2009 cont
f. Branch instructions can cause a problem for
processors which use pipelines.
Branch prediction can reduce this problem.
Describe how branch prediction operates. (3)
2009
(g) The PowerPC processor makes use of Single
Instruction Multiple Data (SIMD)
instructions.
Explain how the use of SIMD instructions
improves performance, using a suitable
multimedia example. (3)
2008
15. The Pentium III processor has eight registers
which can be operated on by SIMD
instructions.
(a) Describe what is meant by a SIMD
instruction. (1)
(b) Describe how the Pentium III could use
SIMD instructions and registers when
adjusting the brightness of a graphic. (3)
Download