Charles Bock
Cortex- A- Application (Fully Featured)
Android/IPhone
Windows RT Tablets
Cortex- R – Real time (RTOS)
Cars
Routers
Infrastructure
Cortex- M – Embedded (Minimal)
Automation
Appliances
ULP Devices
I will focus on Cortex-A15
Most Featured
Most Complex
Most Interesting
ARM processor architecture supports 32-bit ARM and 16-bit Thumb
ISAs
Superscalar, variable-length, out-of-order pipeline.
Dynamic branch prediction with Branch Target Buffer (BTB) and
Global History Buffer (GHB)
Two separate 32-entry fully-associative Level 1 (L1) Translation
Look-aside Buffers
4-way set-associative 512-entry Level 2 (L2) TLB in each processor
Fixed 32KB L1 instruction and data caches.
Shared L2 cache of 4MB
40 Bit physical addressing (1TB)
RISC (ARM – Advanced Risc Machine)
Fixed instruction width of 32 bits for easy decoding and pipelining, at the cost of decreased code density.
Additional Modes or States allow Additional Instruction sets
Thumb (16 bit)
Thumb 2 (16 and 32 bit)
Jazzelle (Byte Code)
Trade-Off: 32 Bit arm vs 16 bit Thumb
Thumb is a 16-bit instruction set
Improved performance, more assumed operands.
Subset of the functionality of the ARM instruction set
Always ADD Op 1 Destination Wasted!
Op 2
ADD
Operands Destination
Jazelle DBX technology for direct java bytecode execution
Direct interpretation bytecode to machine code
Fetch
Decode
Dispatch
Execute
Load/Store
WriteBack
FP / SIMD Depth 18-24
Integer Depth 15 (Same as recent Intel Cores)
Instructions broken down into Sub Operations here
This is Genius
Register Renaming SIMD
Up to 128 bits per fetch depending on alignment
ARM Set: 4 Instructions (32 bit)
Thumb Set: 8 Instructions (16 bit)
Only 3 can be dispatched per cycle.
Support for unaligned fetch address.
Branch prediction begins in parallel with fetch.
Global History Buffer
3 arrays: Taken array, Not taken array, and Selector
microBTB
Reduces bubble on taken branches
64 entry fully associative for fast turn around prediction
Caches taken branches only
Overruled by main predictor if they disagree
Indirect Predictor
256 entry BTB indexed by XOR of target and address
Xor Allows for indexing of Multiple Target addresses per branch
Return Address Stack
8-32 entries deep
indirect jumps (85%) are returns from functions
Push on call
Pop on Ret
Deeper Pipeline = Larger mispredict penalty
Static Predictor: Always Predicts True if Not Known
Instructions are Decoded into discrete sub operations
Multiple Issue Queues (8)
Instructions dispatched 3 per cycle to the appropriate issue queue
The instruction dispatch unit controls when the decoded instructions can be dispatched to the execution pipelines and when the returned results can be retired
RRT (Register rename Table)
Maps from Used register to available register
Rename Loop
Queue which stores available registers for use
Registers removed when in use
Registers re-added when retired from use
13 General Purpose Registers R0-R12
R13 = Stack Pointer
R14 = Return Address (Function Calls)
R15 = Program Counter
32 Entries Long
Can contain up to two “forward” and one “backward” branch
Completely shuts down fetch and large parts of decode stages.
Why? Saves power, Saves time.
Smart!
Integer Lane
Single cycle integer operations
2 ALUs, 2 shifters
FPU / SIMD (NEON) Lane
Asymetric, Varying Length 2-10 Cycles
Branch Lane
Any operation that targets the PC for writeback, usually 1 cycle
Mult / Div Lane
All Mult/Div operations, 4 cycles.
Load / Store Lane
Cache / Mem access 4 cycles.
Cache maintenance
1 load and 1 store per cycle
Load cannot bypass store, store cannot bypass store
Issue queue 16 deep
Out of order but cannot bypass stores (safe)
Stores in order but only require address to issue
Pipeline
AGU Address generation Unit / TLB Lookup
Address and Tag Setup
Data / Tag Access
Data selection and forwarding
32KB 2-way set-associative cache.
64 Byte Block so 256 Blocks * 2 way Assoc. = 32KB
Physically-Indexed and Physically-Tagged (PIPT).
Strictly enforced write-through (Important for cache consistancy!)
16 Way Set Assoc, 4MB
4 tag banks to handle parallel requests
All Snooping is done at this level to keep caches consistent.
If a core is powered down its L1 cache can be restored from L2.
Any “Read Clean” Requests on the bus can be serviced by L2.
Supports Automatic Prefetching for Streaming Data Loads
Layer One:
Two separate 32-entry fully associative L1 TLBs for data load and store pipelines.
Layer Two:
4-way set-associative 512-entry L2 TLB in each processor
In General:
The TLB entries contain a global indicator or an Address Space Identifier (ASID) to permit context switches without TLB flushes.
The TLB entries contain a Virtual Machine Identifier (VMID) to permit virtual machine switches without TLB flushes.
Miss :
Trade off: add more hardware for faster page fault handling or let the os handle it in software?
CPU Includes full table walk machine incase of TLB Miss, no OS involvement required.
Combine A15 with A7.
Interconnect Below The L2 Shared Cache
[1] Arm Information Center, infocenter.arm.com, 2012, http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438g/DDI0438G_cortex_a15_r3p2_trm.pdf
[2] BDTi, bdti.com, 2012, http://www.bdti.com/InsideDSP/2011/11/17/ARM
[3] Arm, arm.com, 2012, http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf
[4] Meet ARM’s Cortex A15, wired.com, 2012, http://www.wired.com/insights/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/
[5] ARM Cortex-A15 explained, extremetech.com, 2012, http://www.extremetech.com/computing/139393-arm-cortex-a15-explained-intels-atom-is-down-but-not-out
[6] eecs373, web.eecs.umich.edu, 2012, http://web.eecs.umich.edu/~prabal/teaching/eecs373/readings/ARM_Architecture_Overview.pdf
[7] ARM Cortex A Programming Guide, cs.utsa.edu, 2012, http://www.cs.utsa.edu/~whaley/teach/FHPO_F11/ARM/CortAProgGuide.pdf
[8] Branch Prediction Review, cs.washington.edu, 2012, http://www.cs.washington.edu/education/courses/cse471/12sp/lectures/branchPredStudent.pdf
[9] Cortex A 15, 7-cpu.com, 2012, http://www.7-cpu.com/cpu/Cortex-A15.html