Lessons from the ARM Architecture

advertisement
Lessons from the ARM
Architecture
Richard Grisenthwaite
Lead Architect and Fellow
ARM
1
ARM Processor Applications
2
Overview
ƒ Introduction to the ARM architecture
ƒ Definition of “Architecture”
ƒ History & evolution
ƒ Key points of the basic architecture
ƒ Examples of ARM implementations
ƒ Something to the micro-architects interested
ƒ My Lessons on Architecture design
ƒ …or what I wish I’d known 15 years ago
3
Definition of “Architecture”
ƒ
ƒ
The Architecture is the contract between the Hardware and the Software
ƒ
ƒ
The architecture distinguishes between:
ƒ
ƒ
ƒ
ƒ
Confers rights and responsibilities to both the Hardware and the Software
MUCH more than just the instruction set
Architected behaviors:
ƒ Must be obeyed
ƒ May be just the limits of behavior rather than specific behaviors
Implementation specific behaviors – that expose the micro-architecture
ƒ Certain areas are declared implementation specific. E.g.:
ƒ Power-down
ƒ Cache and TLB Lockdown
ƒ Details of the Performance Monitors
Code obeying the architected behaviors is portable across implementations
ƒ
Reliance on implementation specific behaviors gives no such guarantee
Architecture is different from Micro-architecture
ƒ
4
What vs How
History
ƒ ARM has quite a lot of history
ƒ
ƒ
ƒ
First ARM core (ARM1) ran code in April 1985…
ƒ 3 stage pipeline very simple RISC-style processor
Original processor was designed for the Acorn Microcomputer
ƒ Replacing a 6502-based design
ARM Ltd formed in 1990 as an “Intellectual Property” company
ƒ Taking the 3 stage pipeline as the main building block
ƒ This 3 stage pipeline evolved into the ARM7TDMI
ƒ
ƒ
Still the mainstay of ARM’s volume
Code compatibility with ARM7TDMI remains very important
ƒ Especially at the applications level
ƒ The ARM architecture has features which derive from ARM1
ƒ
5
Strong “applications level” compatibility focus in the ARM products
Evolution of the ARM Architecture
ƒ Original ARM architecture:
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
32-bit RISC architecture focussed on core instruction set
16 Registers - 1 being the Program counter – generally accessible
Conditional execution on all instructions
Load/Store Multiple operations - Good for Code Density
Shifts available on data processing and address generation
Original architecture had 26-bit address space
ƒ
ƒ
Augmented by a 32-bit address space early in the evolution
Thumb instruction set was the next big step
ƒ
ƒ
ƒ
6
ARMv4T architecture (ARM7TDMI)
Introduced a 16-bit instruction set alongside the 32-bit instruction set
ƒ Different execution states for different instruction sets
ƒ Switching ISA as part of a branch or exception
ƒ Not a full instruction set – ARM still essential
ARMv4 architecture was still focused on the Core instruction set only
Evolution of the Architecture (2)
ƒ
ARMv5TEJ (ARM926EJ-S) introduced:
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Better interworking between ARM and Thumb
ƒ Bottom bit of the address used to determine the ISA
DSP-focussed additional instructions
Jazelle-DBX for Java byte code interpretation in hardware
Some architecting of the virtual memory system
ARMv6K (ARM1136JF-S) introduced:
ƒ
ƒ
ƒ
Media processing – SIMD within the integer datapath
Enhanced exception handling
Overhaul of the memory system architecture to be fully architected
ƒ Supported only 1 level of cache
ARMv7 rolled in a number of substantive changes:
ƒ
ƒ
ƒ
ƒ
7
Thumb-2* - variable length instruction set
TrustZone*
Jazelle-RCT
Neon
Extensions to ARMv7
ƒ MPE – Multiprocessing Extensions
ƒ Added Cache and TLB Maintenance Broadcast for efficient MP
ƒ VE - Virtualization Extensions
ƒ Adds hardware support for virtualization:
ƒ 2 stages of translation in the memory system
ƒ New mode and privilege level for holding an Hypervisor
ƒ With associated traps on many system relevant instructions
ƒ Support for interrupt virtualization
ƒ Combines with a System MMU
ƒ LPAE – Large Physical Address Extensions
ƒ Adds ability to address up to 40-bits of physical address space
8
VFP – ARM’s Floating-point solution
ƒ VFP – “Vector Floating-point”
ƒ Vector functionality has been deprecated in favour of Neon
ƒ Described as a “coprocessor”
ƒ Originally a tightly-coupled coprocessor
ƒ Executed instructions from ARM instruction stream via dedicated
interface
Now more tightly integrated into the CPU
ƒ
ƒ Single and Double precision floating-point
ƒ Fully IEEE compliant
ƒ Until VFPv3, implementations required support code for denorms
ƒ Alternative Flush to Zero handling of denorms also supported
ƒ Recent VFP versions:
ƒ VFPv3 – adding more DP registers (32 DP registers)
ƒ VFPv4 – adds Fused MAC and Half-precision support (IEEE754-2008)
9
ARM Architecture versions and products
ƒ
Key architecture revisions and products:
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
10
ARMv1-ARMv3:
ARMv4T:
ARMv5TEJ(+VFPv2):
ARMv6K(+VFPv2):
largely lost in the mists of time
ARM7TDMI – first Thumb processor
ARM926EJ-S
ARM1136JF-S, ARM1176JFZ-S,
ARM11MPCore – first Multiprocessing Core
ARMv7-A+VFPv3
Cortex-A8
ARMv7-A+MPE+VFPv3:
Cortex-A5, Cortex-A9
ARMv7-A+MPE+VE+LPAE+VFPv4
Cortex-A15
ARMv7-R :
ARMv6-M
ARMv7-M:
Cortex-R4, Cortex-R5
Cortex–M0
Cortex-M3, Cortex-M4
ARMv7TDMI
ƒ
Simple 3 stage pipeline
ƒ
ƒ
ƒ
Fetch, Decode, Execute
Multiple cycles in execute stage
for Loads/Stores
Simple core
ƒ
11
“Roll your own memory system”
ARM926EJ-S
ƒ
5 stage pipeline single issue core
ƒ Fetch, Decode, Execute, Memory, Writeback
ƒ Most common instructions take 1 cycle in each pipeline stage
ƒ Split Instruction/Data Level1 caches Virtually tagged
ƒ MMU – hardware page table walk based
Java Decode Java Decode
Stack
Management
Register
Read
Thumb Decode
Instruction
Fetch
Instruction
Stream
FETCH
12
Register Register
Decode Read
Compute
Partial Products
Sum/Accumulate
& Saturation
Shift + ALU
Memory Access
Register
Write
ARM Decode
Register Register
Decode Read
DECODE
EXECUTE
MEMORY WRITEBACK
ARM1176JZF-S
ƒ 8 stage pipeline single issue
ƒ
Split Instruction/Data Level1 caches Physically tagged
ƒ
ƒ
ƒ
Two cycle memory latency
MMU – hardware page table walk based
Hardware branch prediction
PF1
PF2
I-Cache Access
+
Dynamic
Branch Prediction
DE
Decode
+
StaticB
P
RStack
ISS
Instr
Issue
+
Regist
er
Read
SH
ALU
SAT
ALU and MAC Pipeline
WB
ex
MAC1 MAC2 MAC3
LSU Pipeline
LS
add
13
DC1
DC2
WB
LS
Cortex-A8
ƒ Dual Issue, in-order
ƒ 10 stage pipeline (+ Neon Engine)
ƒ 2 levels of cache – L1 I/D split, L2 unified
ƒ Aggressive Branch Prediction
F0
F1
F2
D0
D1
D2
D3
D4
E1
E0
Branch mispredict penalty
E2
E3
E4
E5
Instruction Execute and Load/Store
M0 M1 M2
M3
NEON
N1 N2 N3 N4 N5
NEON register writeback
Integer register writeback
Replay penalty
Integer ALU pipe
ALU pipe 0
Integer MUL pipe
MUL pipe 0
Instruction
Fetch
NEON
Instruction
Decode
ALU pipe 0
Instruction Decode
Integer shift pipe
Non-IEEE FP ADD pipe
LS pipe 0 or 1
Non-IEEE FP MUL pipe
IEEE FP engine
L1 instruction cache miss
L1 data
L1 data cache miss
L2 data
Load and store
data queue
L2
L3
L2 Tag Array
L4
L5
L6
LS permute pipe
NEON store data
BIU pipeline
L1
N6
L7
L8
L2 Data Array
L9
Embedded Trace Macrocell
T0 T1 T2 T3 T4 T5
T6 T7 T8 T9 T10 T11 T12 T13
External trace port
L3 memory system
14
Cortex-A9
ƒ
ƒ
Dual Issue, out-of-order core
MP capable – delivered as clusters of 1 to 4 CPUs
ƒ
ƒ
ƒ
15
MESI based coherency scheme across L1 data caches
Shared L2 cache (PL310)
Integrated interrupt controller
Cortex-A15 – Just Announced - Core Detail
ƒ
2.5Ghz in 28 HP process
ƒ 12 stage in-order, 3-12 stage OoO pipeline
ƒ 3.5 DMIPS/Mhz ~ 8750 DMIPS @ 2.5GHz
ƒ
ƒ
ARMv7A with 40-bit PA
Dynamic repartitioning Virtualization
ƒ
ƒ
ƒ
Fast state save and restore
Move execution between cores/clusters
128-bit AMBA 4 ACE bus
ƒ Supports system coherency
ƒ
64-bit/128bit AMBA4 interface
ECC on L1 and L2 caches
Eagle Pipeline
Simple Cluster 0
12 Stage In-order pipeline
(Fetch, 3 Instruction
Decode, Rename)
16
Rename
Rename
Decode
Decode
Fetch
Fetch
Simple Cluster 1
Multiply Accumulate
Complex
Complex
Load & Store 0
Load & Store 1
3-12 Stage
out-of-order
pipeline
(capable of 8
issue)
Just some basic lessons from experience
ƒ
Architecture is part art, part science
ƒ
ƒ
ƒ
There is no one right way to solve any particular problem
ƒ ….Though there are plenty of wrong ways
Decision involves balancing many different criteria
ƒ Some quantitative, some more subjective
Weighting those criteria is inherently subjective
ƒ
Inevitably architectures have fingerprints of their architects
ƒ
Hennessy & Patterson quantitative approach is excellent
ƒ
ƒ
Computer science – we experiment using benchmarks/apps
ƒ
ƒ
But it is only a framework for analysis – a great toolkit
…but the set of benchmarks/killer applications is not constant
Engineering is all about technical compromise, and balancing factors
ƒ
17
The art of Good Enough
First Lesson – It’s all about Compatibility
ƒ
ƒ
Customers absolutely expect compatibility
ƒ
ƒ
Customers buy your roadmap, not just your products
The software they write is just expected to work
ƒ Trying to obsolete features is a long-term task
Issues from the real world:
ƒ
ƒ
ƒ
18
Nobody actually really knows what their code uses
ƒ …and they’ve often lost the sources/knowledge
People don’t use features as the architect intended
ƒ Bright software engineers come up with clever solutions
ƒ The clever solutions find inconvenient truths
Compatibility is with what the hardware actually does
ƒ Not with how you wanted it to be used
ƒ There is a thing called “de facto architecture”
Second lesson – orthogonality
ƒ
Classic computer texts tell you orthogonality is good
ƒ
Beware false orthogonalities
ƒ
ARM architecture R15 being the program counter
ƒ Orthogonality says you can do lots of wacky things using the PC
ƒ On a simple implementation, the apparent orthogonality is cheap
ƒ
ARM architecture has “shifts with all data processing”
ƒ Orthogonality from original ARM1 pipeline
ƒ But the behaviour has to be maintained into the future
Not all useful control configurations come in powers of 2
ƒ
ƒ
Fear the words “It just falls out of the design”
ƒ
ƒ
19
True for today’s microarchitecture – but what about the next 20 years?
Try to only architect the functionality you think will actually be useful
ƒ Avoid less useful functionality that emerges from the micro-architecture
Third Lesson – Microarchitecture led features
ƒ
Few successful architectures started as architectures
ƒ
ƒ
ƒ
Most architectures have “micro-architecture led features”
ƒ
ƒ
ƒ
ƒ
Code runs on implementations, not on architectures
People buy implementations, not architectures
ƒ …IP licensing notwithstanding
“it just fell out”
Optimisations based on first target micro-architecture
ƒ MIPS – delayed branch slots
ƒ ARM – the PC offset of 2 instructions
Made the first implementation cheaper/easier than the pure approach
ƒ …but becomes more expensive on subsequent implementations
Surprisingly difficult trade-off
ƒ
20
Short-term/long-term balance
ƒ Meeting a real need sustainably vs overburdening early implementations
Fourth Lesson: New Features
ƒ Successful architectures get pulled by market forces
ƒ
Success in a particular market adds features for that market
ƒ
ƒ
ƒ
Solutions don’t necessarily fit together perfectly
Unsuccessful architectures don’t get same pressures
ƒ
ƒ
Different points of success pull successively over time
…which is probably why they appear so clean!
Be very careful adding new features:
ƒ
ƒ
21
Easy to add, difficult to remove
ƒ Especially for user code
“Trap and Emulate” is an illusion of compatibility
ƒ Performance differential is too great for most applications
Lessons on New Features
ƒ
New features rarely stay in the application space you expect….
ƒ
ƒ
ƒ
If a feature requires a combination of hardware and specific software…
ƒ
ƒ
ƒ
…be afraid – development timescales are different
Be very afraid of language specific features
ƒ All new languages appear to be different…
ƒ ….but very rarely are
Avoid solving this years problem in next year’s architecture
ƒ
ƒ
ƒ
22
…Or want – architects are depressingly powerless
ƒ Customers will exploit whatever they find
So worry about the general applicability of that feature
ƒ Assume it has to go everywhere in the roadmap
... Next year’s problem may be very different
Point solutions often become warts – all architectures have them
If the feature has a shelf-life, plan for obsolence
ƒ Example: Jazelle-DBX in the ARM architecture
Thoughts on instruction set design
ƒ
What is the difference between an instruction and a micro-op?
ƒ
ƒ
ƒ
Splitting instructions easier than fusing instructions
ƒ
ƒ
ƒ
ƒ
RISC principles said they were the same
ƒ Very few PURE RISC architectures exist today
ƒ ARM doesn’t pretend to be “hard-core” RISC
ƒ …I’ll claim some RISC credentials
Choice of Micro-ops is micro-architecture dependent
ƒ An architecture should be micro-architecturally independent
ƒ Therefore mapping of instructions to micro-ops is inherently “risky”
If an instruction can plausibly be done in one block, might be right to express it
ƒ Even if some micro-architectures are forced to split the instruction
But, remember legacy lasts for a very long time
ƒ Imagine explaining the instructions in 5 years time
Avoid having instructions that provide 2 ways to do much the same thing
ƒ
Everyone will ask you which is better - lots of times…
If it feels a bit clunky when you first design it…
ƒ
23
…. it won’t improve over time
Final point – Architecture is not enough
ƒ
Not Enough to ensure perfect “write once, run everywhere”
ƒ
Customers expectations of compatibility go beyond architected behaviour
ƒ
Beware significant performance anomalies between architectural compliant cores
ƒ
Multi-processing introduce huge scope for functional differences from timing
ƒ
BUT THE ARCHITECTURE IS A STRATEGIC ASSET
ƒ People don’t write always code to the architecture
ƒ …and they certainly can’t easily test it to the architecture
ƒ ARM is developing tools to help address this
ƒ Architecture Envelope Models – a suite of badly behaved legal implementations
ƒ The architecture defines whether they are software bugs or hardware incompatibilities
ƒ …allows you to assign blame (and fix the problem consistently)
ƒ If I buy a faster core, I want it to go reliably faster…without recompiling
ƒ Especially in badly synchronised code
ƒ Concurrency errors
24
History – the Passage of Time
25
Microprocessor Forum 1992
26
Count the Architectures (11)
ARM
MIPS
29K
N32x16
PA
88xxx
NVAX Alpha
SPARC
x86
x86
x86 i960 x86
27
The Survivors – At Most 2
ARM
MIPS
SPARC
x86
28
What Changed?
ƒ Some business/market effects
ƒ Some simple scale of economy effects
ƒ Some technology effects
ƒ Adoption and Ecosystem effects
ƒ It wasn’t all technology – not all of the disappearing
architectures were bad
ƒ Not all survivors are good!
ƒ Not all good science succeeds in the market!
ƒ Never forget “Good enough”
29
Thank you
Questions
30
Download