ARM Architecture

advertisement

ARM ARCHITECTURE

Charles Bock

Arm Flavors

Cortex- A- Application (Fully Featured)

Android/IPhone

Windows RT Tablets

Cortex- R – Real time (RTOS)

Cars

Routers

Infrastructure

Cortex- M – Embedded (Minimal)

Automation

Appliances

ULP Devices

I will focus on Cortex-A15

Most Featured

Most Complex

Most Interesting

Cortex A15 Overview

ARM processor architecture supports 32-bit ARM and 16-bit Thumb

ISAs

Superscalar, variable-length, out-of-order pipeline.

Dynamic branch prediction with Branch Target Buffer (BTB) and

Global History Buffer (GHB)

Two separate 32-entry fully-associative Level 1 (L1) Translation

Look-aside Buffers

4-way set-associative 512-entry Level 2 (L2) TLB in each processor

Fixed 32KB L1 instruction and data caches.

Shared L2 cache of 4MB

40 Bit physical addressing (1TB)

Instruction Set

RISC (ARM – Advanced Risc Machine)

Fixed instruction width of 32 bits for easy decoding and pipelining, at the cost of decreased code density.

Additional Modes or States allow Additional Instruction sets

Thumb (16 bit)

Thumb 2 (16 and 32 bit)

Jazzelle (Byte Code)

Trade-Off: 32 Bit arm vs 16 bit Thumb

Thumb

Thumb is a 16-bit instruction set

Improved performance, more assumed operands.

Subset of the functionality of the ARM instruction set

Instruction Encoding

Always ADD Op 1 Destination Wasted!

Op 2

ADD

Operands Destination

Jazzelle

Jazelle DBX technology for direct java bytecode execution

Direct interpretation bytecode to machine code

General Layout

Fetch

Decode

Dispatch

Execute

Load/Store

WriteBack

Block Diagram 1

FP / SIMD Depth 18-24

Integer Depth 15 (Same as recent Intel Cores)

Instructions broken down into Sub Operations here

This is Genius

Register Renaming SIMD

Block Diagram 2

Fetch

Up to 128 bits per fetch depending on alignment

ARM Set: 4 Instructions (32 bit)

Thumb Set: 8 Instructions (16 bit)

Only 3 can be dispatched per cycle.

Support for unaligned fetch address.

Branch prediction begins in parallel with fetch.

Branch prediction - Global History Buffer

Global History Buffer

3 arrays: Taken array, Not taken array, and Selector

Branch prediction -microBTB

 microBTB

Reduces bubble on taken branches

64 entry fully associative for fast turn around prediction

Caches taken branches only

Overruled by main predictor if they disagree

Branch Prediction - Indirect

Indirect Predictor

256 entry BTB indexed by XOR of target and address

Xor Allows for indexing of Multiple Target addresses per branch

Branch Prediction – Return Stack

Return Address Stack

8-32 entries deep

 indirect jumps (85%) are returns from functions

Push on call

Pop on Ret

Branch Prediction - Misc

Deeper Pipeline = Larger mispredict penalty

Static Predictor: Always Predicts True if Not Known

Decode / Out of order Issue

Instructions are Decoded into discrete sub operations

Multiple Issue Queues (8)

Instructions dispatched 3 per cycle to the appropriate issue queue

The instruction dispatch unit controls when the decoded instructions can be dispatched to the execution pipelines and when the returned results can be retired

Register Renaming

RRT (Register rename Table)

Maps from Used register to available register

Rename Loop

Queue which stores available registers for use

Registers removed when in use

Registers re-added when retired from use

13 General Purpose Registers R0-R12

R13 = Stack Pointer

R14 = Return Address (Function Calls)

R15 = Program Counter

Loop Buffer / Loop Cache

32 Entries Long

Can contain up to two “forward” and one “backward” branch

Completely shuts down fetch and large parts of decode stages.

Why? Saves power, Saves time.

Smart!

Execution Lanes

Integer Lane

Single cycle integer operations

2 ALUs, 2 shifters

FPU / SIMD (NEON) Lane

Asymetric, Varying Length 2-10 Cycles

Branch Lane

Any operation that targets the PC for writeback, usually 1 cycle

Mult / Div Lane

All Mult/Div operations, 4 cycles.

Load / Store Lane

Cache / Mem access 4 cycles.

Cache maintenance

1 load and 1 store per cycle

Load cannot bypass store, store cannot bypass store

Load Store Pipeline

Issue queue 16 deep

Out of order but cannot bypass stores (safe)

Stores in order but only require address to issue

Pipeline

AGU Address generation Unit / TLB Lookup

Address and Tag Setup

Data / Tag Access

Data selection and forwarding

L1 Instruction / Data Caches

32KB 2-way set-associative cache.

64 Byte Block so 256 Blocks * 2 way Assoc. = 32KB

Physically-Indexed and Physically-Tagged (PIPT).

Strictly enforced write-through (Important for cache consistancy!)

L2 Shared Cache

16 Way Set Assoc, 4MB

4 tag banks to handle parallel requests

All Snooping is done at this level to keep caches consistent.

If a core is powered down its L1 cache can be restored from L2.

Any “Read Clean” Requests on the bus can be serviced by L2.

Supports Automatic Prefetching for Streaming Data Loads

Dual Layer TLB Structure

Layer One:

Two separate 32-entry fully associative L1 TLBs for data load and store pipelines.

Layer Two:

4-way set-associative 512-entry L2 TLB in each processor

In General:

The TLB entries contain a global indicator or an Address Space Identifier (ASID) to permit context switches without TLB flushes.

The TLB entries contain a Virtual Machine Identifier (VMID) to permit virtual machine switches without TLB flushes.

Miss :

Trade off: add more hardware for faster page fault handling or let the os handle it in software?

CPU Includes full table walk machine incase of TLB Miss, no OS involvement required.

BIG Little

Combine A15 with A7.

Interconnect Below The L2 Shared Cache

References

[1] Arm Information Center, infocenter.arm.com, 2012, http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438g/DDI0438G_cortex_a15_r3p2_trm.pdf

[2] BDTi, bdti.com, 2012, http://www.bdti.com/InsideDSP/2011/11/17/ARM

[3] Arm, arm.com, 2012, http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf

[4] Meet ARM’s Cortex A15, wired.com, 2012, http://www.wired.com/insights/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/

[5] ARM Cortex-A15 explained, extremetech.com, 2012, http://www.extremetech.com/computing/139393-arm-cortex-a15-explained-intels-atom-is-down-but-not-out

[6] eecs373, web.eecs.umich.edu, 2012, http://web.eecs.umich.edu/~prabal/teaching/eecs373/readings/ARM_Architecture_Overview.pdf

[7] ARM Cortex A Programming Guide, cs.utsa.edu, 2012, http://www.cs.utsa.edu/~whaley/teach/FHPO_F11/ARM/CortAProgGuide.pdf

[8] Branch Prediction Review, cs.washington.edu, 2012, http://www.cs.washington.edu/education/courses/cse471/12sp/lectures/branchPredStudent.pdf

[9] Cortex A 15, 7-cpu.com, 2012, http://www.7-cpu.com/cpu/Cortex-A15.html

Download