PowerPC

advertisement
Power PC Architecture
Nirmal Chhugani
Introduction
o PowerPC (Performance Optimization With Enhanced RISC –
Performance Computing) is a RISC architecture created by
(AIM) Apple–IBM–Motorola alliance in 1991.
o The original idea for the PowerPC architecture came
from IBM’s Power architecture (introduced in the Risc/6000) and
retains a high level of compatibility with it.
o The intention was to build a high-performance, superscalar
low-cost processor.
History
o The history of the PowerPC began with IBM's 801 prototype chip
of John Cocke‘ s(IBM Watson Research Lab) RISC ideas in the late
1970s (with further refinements developed by David Paterson).
o 801-based cores were used in a number of IBM embedded
products, eventually becoming the 16-register ROMP (Research
Office Products Division Micro Processor was a 10 MHz
RISC microprocessor designed by IBM in the early 1980)
processor used in the IBM RT(computer workstation by IBM).
o The RT had disappointing performance and IBM started the
project to build the fastest processor on the market. The result
was the POWER architecture, introduced with the RISC
System/6000 in early 1990.
History…….. POWER architecture
The POWER architecture incorporated lots of the RISC
characteristics :
 fixed-length instructions,
 register-to-register architecture,
 simple addressing modes,
 large general register file
 three-operand instruction format.
Additionally, it has other features more characteristic of more complex ISAs.
Power Architecture
o Designed to be superscalar- dispatched across three independent units: branch,
fixed-point arithmetic, and floating point units. This allows out of order execution.
o Compound instructions--updating the base register on a load and store with
the newly calculated effective address, thus eliminating the need for extra add
instructions required to increment the index for array traversals.
o Does not implement delayed branches- Instead the POWER architecture
uses a branch target buffer, and the now well known branch folding technique.
o Branching technique- The POWER architecture has eight condition registers
that are set by compare instructions. One additional bit in the opcode of each
instruction signaled that instructions should be executed only under certain
conditions, a form of predicated execution.
Shortfalls…..
o The original POWER microprocessor, one of the first
superscalar RISC implementations, was a high performance,
multi-chip design.
o IBM soon realized that they would need a single-chip
microprocessor to scale their RS/6000 line from lower-end
to high-end machines.
o Work on a single-chip POWER microprocessor, called the
RSC (RISC Single Chip) began. In early 1991 IBM realized
that their design could potentially become a high-volume
microprocessor used across the industry.
PowerPC Architecture
o In order to maintain RS/6000 software compatibility, the PowerPC adapted
the POWER architecture, and many enhancements were added to provide a
low-cost, single-chip, superscalar, multiprocessor capable, and 64-bit processor.
• Several bit/field instructions that use three source operands were eliminated to
avoid the need for extra register ports.
• Complex string instructions were left out, consistent with the RISC philosophy.
• Instructions whose operation was dependent on the value of source operand
were eliminated.
• Precision shifts, integer multiplies, and divide-with-reminder instructions were
omitted.
• Support for operation in both big-endian and little-endian modes
• Single and double precision floating-point arithmetic 64-bit architecture,
backward compatible to 32-bit
PowerPC family
o PowerPC 601:
medium sized and medium performance processor
includes a more sophisticated branch unit
capable to dispatch three “out-of-order” instructions per cycle.
up to 8 instructions per cycle can be fetched directly into an eight-entry
instruction queue (IQ), where they're decoded before being
dispatched to the execution core.
Branch folding: The instruction queue is used for detecting and dealing
with branches. The branch unit scans bottom four entries of the queue,
identifying branch instructions and determining what type they are
(conditional, unconditional).
In cases where the branch unit has enough information to resolve the
branch right then and there (an unconditional branch, or a conditional
branch whose condition is dependent on information that's already in the
condition register) then the branch instruction is simply deleted from
the instruction queue and replaced with the instruction located at the branch target.
•
•
•
•
o PowerPC 603:
• smaller die size than the 601
• smaller cache
• capable to dispatch three “out-of-order” instructions per cycle.
The 604 and 620 microprocessors were developed in the sequel of the PowerPC production line.
Both aimed for higher performance. The 604 was based on the 32-bit architecture while the 620 is
a 64-bit architecture.
Current Status
 PowerPC e200 - 32 bit power architecture microprocessor - speed ranging up to 600 MHz









- ideal for embedded applications.
PowerPC e300 – similar to e200 with an increase in speed upto 667 MHz.
PowerPC e600 – speed upto 2 Ghz – ideal for high performance routing and
telecommunications applications.
POWER5 – IBM – dual core μP
POWER6 – IBM – Dual core μP - A notable difference from POWER5 is that the POWER6
executes instructions in-order instead of out-of-order
PowerPC G3 - Apple Macintosh computers such as the PowerBook G3, the multicolored
iMacs, iBooks and several desktops, including both the Beige and Blue and White Power
Macintosh G3s.
PowerPC G4 - is a designation used by Apple Computer to describe a fourth generation of 32bit PowerPC microprocessors.
PowerPC G5 - 64-bit Power Architecture processors
Xenon - based on IBM’s PowerPC ISA – XBOX 360 game console.
Broadway – based on IBM’s PowerPC ISA – Nintendo Wii gaming console
 Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004
 Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007
PowerPC ISA
o Mix between Sparc(Risc) and Motorola(Cisc).
o Different implementation levels ( so the chip does not need to
o
o
o
o
be fully implemented for embedded solutions ).
Load and store architecture. Operations are always done over
registers. Memory is never directly addressed.
Offers a large number of mnemonics that increase the number
of instructions without increasing the number of on-chip
instruction.
Passes arguments using registers and the stack.
32-bit Registers, allow to address 4 gigabytes of virtual memory.
Overall design
 Integer Execution Unit
 Floating Point Unit
 Load/Store Unit (LSU)
 Branch Execution Units
 Memory Management Unit
 Memory Unit
 Cache
PowerPC Registers
PowerPC's application-level registers are broken into three categories :
general purpose, floating point and special purpose registers.
o General-purpose registers (GPRs) - r0 to r31
 flat-scheme of 32 general purpose registers.
Source and destination for all integer operations
address source for all load/store operations.
They also provide access to SPRs.
All GPRs are available for use with one exception: in certain instructions, GPR0 simply
means the value 0, and no lookup is done for GPR0's contents.
o Some of these registers have special tasks assigned to them:




•
•
•
•
•
•
•
•
•
r0 Volatile register which may be modified during function linkage
r1 Stack frame pointer, always valid
r2 System-reserved register
r3-r4 Volatile registers used for parameter passing and return values
r5-r10 Volatile registers used for parameter passing
r11-r12 Volatile registers which may be modified during function linkage
r13 Small data area pointer register
r14-r30 Registers used for local variables
r31 Used for local variables or "environment pointers“
Floating point registers
o Floating-point registers (FPRs)- fr0 to fr31
 32 floating-point registers with 64-bit precision.
 source and destination operands of all floating-point operations
 can contain 32-bit and 64-bit signed and unsigned integer values, as well as single



•
•
•
•
•
precision and double-precision floating-point values.
FPR’s also provide access to the FPSCR(Floating-Point Status and Control
Register)
FPSCR captures status and exceptions resulting from floating-point operations,
and also provides control bits for enabling specific exception types.
Instructions to load and store double precision floating point numbers transfers
64-bit of data without conversion.
Instructions to load from memory single precision floating point numbers convert
to double precision format before storing them in the register.
f0 Volatile register
f1 Volatile register used for parameter passing and return values
f2-f8 Volatile registers used for parameter passing
f9-f13 Volatile registers
f14-f31 Registers used for local variables
Special-purpose registers (SPRs)
 The Fixed-Point Exception Register (XER)- used for indicating
conditions for integer operations, such as carries and overflows.
 The Floating-Point Status and Control Register (FPSCR)- 32-bit
register used to store the status and control of the floating-point
operations.
 The Count Register (CTR)- used to hold a loop count that can be
decremented during the execution of branch instructions.
 The Condition Register (CR)-32-bit register grouped into eight
fields, where each field is 4 bits that signify the result of an instruction’s
operation: Equal (EQ), Greater Than (GT), Less Than (LT), and
Summary Overflow (SO).
 The Link Register (LR) contains the address to return to at the end
of a function call.
Data Types

It can use either little-endian or big-endian style.
 Fixed-point data types include:
o Unsigned byte 8–bits
o
o
o
o
o
o
Unsigned halfword 16-bits
Signed halfword 16-bits
Unsigned word 32-bit
Signed word 32-bit
Unsigned doubleword 64-bits
Byte Strings: From 0 – 128 bytes in length
 2’s complement is used for negative values
 floating-point data formats
 single-precision, 32 bits long (23 + 8 + 1)
 double-precision, 64 bits long (52 + 11 + 1)
 characters are stored using 8-bit ASCII codes
Instruction types
Instruction Format
 All instruction encodings are 32 bits in length.
 Bit numbering for PowerPC is the opposite of most other definitions: bit 0 is the
most significant bit, and bit 31 is the least significant bit.
 Instructions are first decoded by the upper 6 bits in a field, called the primary
opcode. The remaining 26 bits contain fields for operand specifiers, immediate
operands, and extended opcodes, and these may be reserved bits or fields.
 Common Instruction formats:
Format
0-5
6-10
11-15
16-20
D-form
opcd
tgt/src
src/tgt
X-form
opcd
tgt/src
src/tgt
src
A-form
opcd
tgt/src
src/tgt
src
BD-form
opcd
BO
BI
I-form
opcd
21-25
26-29
30
31
immediate
LI
extended opcd
src
extended opcd
Rc
BD
AA
LK
AA
LK
Instruction format
 D-form- provides up to two registers as source operands, one immediate source, and up to two
registers as target operands. Some variations of this instruction format use portions of the target and
source register operand specifiers as immediate fields or as extended opcodes.
D-form opcd tgt/src src/tgt
immediate
 X-form- provides up to two registers as source operands and up to two target operands. Some variations
of this instruction format use portions of the target and source operand specifiers as immediate fields or
as extended opcodes.
X-form opcd tgt/src src/tgt
src
extended opcd
 A-form- provides up to three registers as source operands, and one target operand. Some variations of
this instruction format use portions of the target and source operand specifiers as immediate fields or as
extended opcodes.
A-form opcd
tgt/src src/tgt
src
src
extended opcd
Rc
 BD-form- conditional branch instruction. The BO field specifies the type of condition ; BI field specifies
which CR bit to be used as the condition; BD field is used as the branch displacement. AA bit specifies
whether the branch is an absolute or relative branch. The LK bit specifies whether the address of the next
sequential instruction is saved in the Link Register as a return address for a subroutine call.
BD-form opcd
BO
BI
BD
AA
LK
 I-form- used by the unconditional branch instruction. Being unconditional, the BO and BI fields of the
BD format are exchanged for additional branch displacement to form the LI instruction field. This
instruction format also supports the AA and LK bits in the same fashion as the BD format.
I-form
opcd
LI
AA
LK
 Simplified powerpc instrution set http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/
Instruction formats
BD-Form
D-Form
A-Form
PowerPC Addressing Modes
 Load/store architecture
 Indirect
 Instruction includes 16 bit displacement to be added to base register (may be GP
register)
 Can replace base register content with new address
 Indirect indexed
 Instruction references base register and index register (both may be GP)
 EA is sum of contents
 Branch address
Target address calculation
 Absolute
TA= actual address
Relative
TA= current instruction address + displacement {25 bits, signed}
 Indirect
 Arithmetic
 Operands in registers or part of instruction
 Floating point is register only
Link Register
TA= (LR)
Count Register
TA= (CR)
PowerPC function call conventions
 Results from a function call are returned in GPR3, FPR1, or by passing a
pointer to a structure as the implicit leftmost parameter.
 Any parameters that do not fit into the designated registers are passed on
the stack. In addition, enough space is allocated on the stack to hold all
parameters, whether they are passed in registers or not.
 PowerPC run-time environment uses a
grow-down stack that allocates space for
a function's parameters, linkage
information, and for local variables.
 The environment uses a single stack
pointer without any frame pointer.
To achieve this simplification, the
PowerPC stack has a much more rigidly
defined structure.
PowerPC G4e Pipelining
 Seven Stage Pipeline
 Superscalar Microprocessor – allows multiple instructions to be
executed in parallel.









Nine Execution Units
BPU : Branch Processing Unit
VPU : Vector Permute Unit
VIU : Vector Integer Unit
VCIU : Vector Complex Integer Unit
VFPU : Vector Floating Point Unit
FPU : Floating Point Unit
IU : Integer Unit
CIU : Complex Integer Unit
LSU : Load/Store Unit
PowerPC G4e Pipeline Stages
 Stages 1 and 2 - Instruction Fetch:
 These two stages are both dedicated primarily to grabbing an instruction from
the L1 cache.
 The G4e can fetch four instructions per clock cycle from the L1 cache and send
them on to the next stage
 Stage 3 - Decode/Dispatch:
 Once an instruction has been fetched, it goes into a 12-entry instruction queue
to be decoded.
 The G4e's decoder can dispatch up to three instructions per clock cycle to the
next stage.
PowerPC G4e Pipeline Stages
 Stage 4 - Issue:
 The first queue Floating-Point Issue Queue (FIQ), which holds
floating-point (FP) instructions that are waiting to be executed.
 The second is the Vector Issue Queue (VIQ), which holds vector
operations.
 The third queue is the General Instruction Queue (GIQ), which
holds everything else.
 Once the instruction leaves its issue queue, it goes to the
execution engine to be executed.
PowerPC G4e Pipeline Stages
 Stage 5 - Execute:
 The instructions can pass out-of-order from their issue queues
into their respective functional units and be executed.
 Stage 6 and 7 - Complete andWrite-Back :
 In these two stages, the instructions are put back into the order
in which they came into the processor, and their results are
written back to memory.
Design principles
 Simplicity favors' regularity
Standard 32 bit instruction format for all instructions
 fixed-length instructions,
 register-to-register architecture
 three-operand instruction format.
 Smaller is faster
 3- Categories of registers , but each handles specific instructions so presumably
faster access time
 Make the common case fast
 Integer and floating point instructions
 Good design demands good compromises
 To align with RISC principles many instructions that required three source
operands were eliminated
 Many complex instructions curtailed to confirm with RISC principles but
compensated by large number of mnemonics that increase the number of
instructions .
Pros and Cons
 Instruction Set
 200 machine instructions
 More complex than most RISC machines
 e.g. floating-point “multiply and add” instructions that take three input
operands
 e.g. load and store instructions may automatically update the index register to
contain the just-computed target address
 Pipelined execution
 More sophisticated than SPARC
 Input and Output
 Two different modes
 Direct-store segment: map virtual address space to an external address space
 Normal virtual memory access
 Permits a range of implementation from low cost controllers
through high performance processors.
References
 http://www.ibm.com/developerworks/linux/library/l




powarch/
http://www.cresco.enea.it/LA1/cresco_sp14_ylichron/CB
E-docs/PowerPC_Vers202_Book1_public.pdf
http://en.wikipedia.org/wiki/PowerPC
http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruc
tion-set
http://www.eecs.umich.edu/~stever/373/lecnotes2.pdf
http://www.devx.com/ibm/Article/20943
Download