Padmini and Sanchi

advertisement
Revolver: Processor Architecture for
Power Efficient Loop Execution
Mitchell Haygena, Vignayan
Reddy and Mikko H. Lipasti
-Padmini Gaur( 13IS15F)
Sanchi (13IS20F)
Contents
•
•
•
•
The Need
Approaches and Isssues
Revolver: Some basics
Loop Handling
▫ Loop Detection
 Detection and Training
 Finite State Machine
• Loop Execution
• Scheduler
▫ Units
▫ Tag Propagation Unit
• Loop pre-execution
• Conclusion
• References
The Need
• Per-transistor energy benefit improvement
• Increasing computational efficiency
▫ Power efficient mobile, server
▫ Increasing energy contraints
• Elimination of unnecessary pipeline activity
• Managing energy utilization
▫ Small energy requirements of instruction
execution but Large control overheads
So far: Approaches and Issues
• Pipeline centric instruction caching
▫ Emphasizing temporal instruction locality
• Capturing loop instruction in buffer
▫ Inexpensive retrieval for future iterations
• Out-of-order processors: Issues?
▫ Resource allocation
▫ Program ordering
▫ Operand dependency
Instructions serviced by Loop Buffer
Energy Consumption
[Power Efficient Loop Execution Techniques:
Mitchell Bryan Hayenga]
Revolver: An enhanced approach
• Out-of-order back-end
• Overall design similar to normal processor
• Non-loop instructions
▫ Follow normal conventional pipeline
• No Register Allocation Table on front-end
instead Tag propagation unit at back-end
• Loop mode:
▫ Detection and dispatching loop to back-end
The promises
•
•
•
•
No additional resource allocation
Energy consumption at front-end managed
Pre-execution of future iterations
Operand dependence linking moved to back-end
Loop handling
• Loop detection
• Training feedback
• Loop execution
▫ Wakeup logic
▫ Tag Propagation Unit
• Load Pre-execution
Loop Detection
• Detection (at) stages:
▫ Post-execution
▫ At decode stage
• Enabling loop mode at decode
• Calculation of:
▫ Start address
▫ Required resources
Detection and Training
• Key mechanisms:
▫ Detection logic at front-end -> dispatched
▫ Feedback by back-end: Profitability of loops
• Profitability
▫ Disabling future loop-mode
• Detection control
▫ Loop Detection Finite State Machine
FSM
FSM states
• Idle: Through decode until valid/profitable loop
or PC-relative backward branch/jump detection
• Profitability logged in Loop Address Table
• LAT records:
▫ Composition and profitability
• Profitable loop dispatched
• Backward jump/ branch and No loop
▫ Train State
• Train state:
▫ Records start address
▫ End address
▫ Allowable unroll factor
• Resources required added to LAT
• Loop ends -> Idle state
• In dispatch state the decode logic guides the
dispatch of loop instructions into the out of
order backend.
• Disabling loop mode on:
▫ System calls
▫ Memory barriers
▫ Load-store linked conditional pair
Training Feedback
• Profitability
▫
▫
▫
▫
▫
▫
4-bit counter
Default value =8
Loop mode enabling if value>=8
Dispatched loop unrolled more than twice, +2
Else, -2
Mis-prediction other than fall-through, profitability
set = 0
• Disabled loops:
▫ Front-end increments by 1 for 2 sequential successful
dispatch
Loop: Basic idea
• Unrolling loop:
▫ Depending on back-end resources
▫ As much as possible
▫ Eliminating additional resource use after dispatch
• Loop instruction stays in issue queue, executes
till completion of iteration
• Maintaining provided resources across multiple
executions
• Load-store queues modified maintaining
program order
Contd..
• Proper access of destination and source register
• Loop exit:
▫ Removing instructions from back-end
▫ Loop fall-trough path dispatch
Loop execution: Let’s follow
• Green: Odd numbered
• Blue: Even numbered
• Pointers:
▫ Program order
maintenance: loop_start,
loop_end
▫ Oldest uncommitted entry:
commit
Loop execution, contd..
• Commiting:
▫ Start to end
▫ Wrapping to start: next loop iteration
▫ Resetting issue queue entries for next loop
iteration
▫ Load queue entries invalidated
▫ Store queue entries:
 Passed to write-buffer
 Immediate reuse in next iteration
 Cannot write to buffer -> stall (very rare)
Scheduler: Units
• Wake-up array
▫ Identifying Ready instructions
• Select logic
▫ Arbitration between reading instructions
• Silo instruction
▫ Producing the opcode and physical identifiers of
selected instruction
Scheduler: The design
Scheduler: The concept
• Managed as queue
• Maintains program order among entries
• Wakeup array
▫ Utilizes logical register identifiers
▫ Position dependence
• Tag Propagation Unit (TPU)
▫ Physical register mapping
Wakeup Logic: Overview
• Observes generated results:
▫ Identifying new instructions capable of being
executed
• Program based ordering
• Broadcast of logical register identifier
▫ No need for renaming
▫ No physical register identifier in use
Wakeup: The design
Wake up array
• Rows: Instructions
• Columns: Logical registers
• Signals:
▫ Request
▫ Granted
▫ Ready
Wakeup operation
• Allocation into wake up array
▫ Marking logical source and destination registers
• Unscheduled instruction
▫ Deassert downstream register column
▫ Preventing younger, dependent instructions from
waking up
• Request sent when:
▫ Receiving all necessary source register broadcasts
▫ Ready source registers
• Select grants the request:
▫ Asserting downstream ready
▫ Waking up younger dependent instructions
• Wakeup logic cell:
▫ 2 state bits: sourcing/producing logical register
The simple logic
An example with dependence
Tag Propagation Unit (TPU)
• No renaming!
• Maps physical register identifier to logical
registers
• Enables reuse of physical register
▫ As no additional resources
▫ Physical register management
• Possible speculative execution of next loop
iterations
Next loop iteration??
• Impossible if:
 Instruction only has access to single physical
destination register
• Speculative execution:
▫ Alternative physical register identifier needed
• Solution: 2 physical destination registers
▫ Alternative writing between 2
With 2 destination registers
• Double Buffering
▫ Maintaining previous state while speculative
computation
▫ N+1 commits, reusing destination register of
iteration N on iteration N+2
▫ No instruction dependence in N and N+2
▫ Speculative writing in output register allowed
With Double buffering
• Dynamic linkage between dependent
instructions and source registers
• Changing logical register mapping
▫ Overwriting output register column
• Instruction stored in program order:
▫ Downstream instructions obtain proper source
mapping
Source, destination and iteration
Register reclamation
• Any instruction misprediction:
▫ Flushing downstream instructions
▫ Propagation of mappings to all newly scheduled
instructions
• Better than RAT:
▫ Complexities reduced
Queue entries: Lifetime
• Received prior to dispatch
• Retained till instruction exit from backend
• Reused to execute multiple loop iterations
▫ Immediate freeing of LSQ upon commit
▫ Position based age logic in LSQ
• Load queue entries:
▫ Simply reset for future
Store Queue entries: An extra effort
• Need to write back
• Drained into write buffer immediately between
L1 Cache and queue
• If cannot write stall
▫ Very rare
• Wrapping around of commit pointer
Loop pre-execution
• Pre-execution of future loads:
▫ Parallelization
▫ Enabling zero-latency loads
 No L1 cache access latency
• Repeated execution of load till completion of all
iterations
• Exploiting recurrent nature of loop:
▫ Highly predictable address patterns
Learning from example: String copy
•
•
•
•
Copying source array to destination array
Predictable load address
Accessing consecutive bytes from memory
Primary addressing access patterns:
▫ Stride
▫ Constant
▫ Pointer-based
• Placing simple pattern identification hardware
alongside pre-executed load buffers
Stride based addressing
• Most common
• Iterating over data array
▫ Computing address Δ between 2 consecutive
loads
▫ Third load matches predicted stride: Stride
verification
▫ Pre-execution of next load
• Constant: A special case of zero-sized stride
▫ Reading from same address
▫ Stack allocated variables/ Pointer aliasing
Pointer based addressing
• Value returned by current load -> next address
• E.g. Linked list traversals
Pre-execution: more..
• Pre-executed load buffer placed between load
queue and L1 Cache interface
• Store clashes with pre-executed load
▫ Invalidating entry
▫ Coherency maintenance
• Pre-executed loads:
▫ Speculatively waking up dependent operations on
next cycle
• Incorrect address prediction:
▫ Scheduler cancels and re-issues operation
Conclusion
• Minimizing energy during loop execution
• Elimination of front-end overheads originating
from pipeline activity and resource allocation
• Benefits achieved better than in loop buffers and
μop caches
• Pre-execution increases performance during
loop execution by hiding L1 cache latencies
• According to research, 5.3-18.3% energy-delay
benefit
References
• Scheduling Reusable Instructions for Power
Reduction (J. Hu, N. Vijaykrishnan, S. Kim, M.
Kandemir, M. Irwin),2004
• Matrix Scheduler Reloaded (P. G. Sassone, J.
Rupley, E. Breckelbaum, G. H. Loh, B. Black)
• Instruction Fetch Energy Reduction Using Loop
Caches for Embedded Applications with Small
Tight Loops (L. H. Lee, B. Moyer, J. Arends)
• Power Efficient Loop Execution Techniques
(Mitchell Bryan Hayenga)
Download