2013-DCE-Naveen-Kumar-Keynote

advertisement

A Case for Hybrid

Microarchitectures

Naveen Kumar

Intel Corporation

Santa Clara, CA

Jan 22, 2013

Disclaimer

The following comments and opinions do not reflect

Intel’s official position and should be construed solely as Naveen Kumar’s personal opinions. Technical data in this presentation has been previously disclosed in academic and industrial venues.

Hybrid Microarchitecture

• HW/SW co-designed, based upon the realization

– Hardware is good at some tasks

– Software is good at certain other tasks

– Co-design to leverage best of both worlds SW (DBT)

• Central tenets of Hybrid design HW

– Simplify hardware

– Move complexity into software

Hybrid

Microprocessor

• Commercial realization

– Transmeta Crusoe and Efficeon

– VLIW front-end, in-order backend and a DBT layer

– Continuous Dynamic Compilation and Adaptation

– Not the only possible solution …

ISA Public: x86

ISA Private:

RISC

Outline

• Transmeta Efficeon as background on Hybrid uarch

• Recent research work

– Assertion to improve scheduling [ASPLOS 2010]

– BlockChop to improve MP performance [ISCA 2012]

– Co-designed power optimizations [Under submission]

• Microprocessor design challenges

• Conclusion

Transmeta Efficeon Hardware

• VLIW front-end and an in-order backend

– Upto 6 issue VLIW, RISC like instructions

– 2 loads or stores

– 2 integer ALU

– 2 SIMD

– 1 branch/call/other control

• Support for hardware atomicity

– Speculative bit in register files and cachelines, commit/rollbk ops

– Co-designed software leverages atomicity for aggressive optimizations

• Alias hardware

• What it did not have

– x86 decoders

– OOO backend (MOB, ROB)

Efficeon’s Code Morphing Software

• Dynamic Binary Translation and Optimization

– Speculation: Translator makes aggressive assumptions about code to achieve higher performance

– Recovery: Check assumptions and rollback to commit points if they prove to be false, for precise interpretation

– Adaptive retranslation: If recovery is required too often, retranslate with less aggressive assumptions

CMS Gearing System

1 st Gear

Executes 1 instruction at a time

• Profiles code at runtime

• Gathers data for flow analysis

• Gathers branch frequencies and directions

• Detects load/store typing (IO vs memory)

Filters out infrequently executed code

No startup cost

Lowest speed

CMS Gearing System

2 nd Gear

Uses profile information to create initial translation

• Translates a “Region” of up to100 x86 instructions.

• Adds flow graph “Shape” information

• Light Optimization

• “Greedy” scheduling

Low translation overhead

Fast execution

CMS Gearing System

3 rd Gear

Further optimizes gear 2 regions

• Common sub-expression elimination

• Memory re-ordering

• Significant code optimization

• Critical path scheduling

Medium translation overhead

Faster execution

CMS Gearing System

4 th Gear

Most advanced optimizations for hottest regions

• Splices together multiple regions

• Optimizes across region boundaries

• Critical path scheduling

Highest translation overhead

Fastest execution

Performance Comparison

(From Sept 2004 Microprocessor Forum)

Bringing DBT Mainstream

Transmeta Efficeon: Summary

• Interesting vision and implementation

• Fit for the time when it was launched

– 90 nm tech, now we are at 22 and going lower

– Low power is today’s norm

– Multicore, more ISA functionality, security, privacy, FT…

• More recently…

Some Recent Research Work

• A real system evaluation of hardware atomicity for software speculation [ASPLOS 2010]

• BlockChop: Dynamic Squash Elimination for Hybrid Processor

Architecture [ISCA 2012]

• Component Level Power Gating [Under Submission]

• Reducing Startup Time in Co-Designed Virtual Machines [ISCA 2006]

• LAR-CC: Large atomic regions with conditional commits [CGO 2011]

• A HW/SW co-designed heterogeneous multi-core virtual machine for energy-efficient general purpose computing [CGO 2011]

• …

A real system evaluation of hardware atomicity for software speculation [ASPLOS 2010]

Example from Vortex

Atomic Region Abstraction

Atomic Region Benefits

• Dependence graph shown

• Atomic region trivially enables optimizer to eliminate operations

– 88 -> 73 Efficeon operations

– 7% reduction

• Relaxes scheduling constraints

– 26 -> 19 cycles

– 27% reduction

BlockChop: Dynamic Squash Elimination for Hybrid

Processor Architecture [ISCA 2012]

Shared Data Conflicts in Hybrid

Multicore

• Multiple atomic blocks may access and modify the same data block

• Violates atomicity assumption

• Results in a squash

• Frequent squashes kill performance  Atomicity hurts!!

BlockChop

• Introduce BlockChop for shared data block conflicts

• BlockChop can leverage interpreter, translator, or region cache to respond to squashes

• Makes decisions on how to respond with a squash handling mechanism (SHM)

BlockChop Squash Handling

• Adaptive chopper performs remarkably well

Component Level Power Gating [Under Submission]

SIMD Instruction Usage in VPR

Intervals of execution

• Resource usage (SIMD unit) is variable: power-gate opportunity

SIMD Instruction Usage in VPR

Intervals of execution

• Resource usage (SIMD unit) is variable: power-gate opportunity

– An ideal predictor (red) can turn off VPU at fine granularity – but at what power cost of its own?

SIMD Instruction Usage in VPR

Intervals of execution

• Resource usage (SIMD unit) is variable: power-gate opportunity

– An ideal predictor (red) can turn off VPU at fine granularity – but at what power cost of its own?

– A predictor sampling at coarse granularity (blue) is sub-optimal

– How to get to the red curve?

Power Gating in the BT System

• Profile execution to determine when a component is used

• Insert explicit (speculative) instructions to turn off or on a component

– No need for a complex hardware predictor

– Can still achieve near optimal power savings

• Speculation failure leads to recovery and adaptive retranslation (i.e., remove the instruction)

Recent Research: Summary

Lot of interest in hybrid microprocessors

 What microprocessor design challenges do these address?

Microprocessor Design Challenges

Performance and Power Still Key

• Performance per watt important for smartphones & tablets to exascale computers

• Much of recent work on Hybrid architectures in this area

• Better design tradeoffs:

 Promising but needs to be proven: a lot more work required

Security and Privacy

• Traditionally, microprocessors have not been built with security and privacy in mind

– No longer feasible to continue designing in that manner

– Rise of sophisticated cybercrime for commercial (criminal) profits and govt. spying

– Malware in every computing form factor: how long before they access your digital wallet on the phone

• Hybrid architecture permits a middleware to diagnose, prevent and react to attacks

– A microarchitecture that can inspect code and sandbox data

– Flexibly update the protections

 How much better than an OS or a VMM?

Diversity of Market Segments

• The number of form factors for computing devices has grown

– Each with a different requirement and a microarchitecture

– Different design goals imply different tradeoffs

– Industry heavyweights would like to consolidate

– Also brings down prices

 Can we create a malleable architecture to span all segments?

What are the requirements and challenges?

Shorter Time to Market

• Design cycle of 4-5 years now considered too long

– Esp. when there are vendors who roll out a chip every two years

– Possible with simpler baseline, separate arch/uarch/fabrication designs

 Also sub-optimal; co-designed arch with uarch and with knowledge of fabrication technology is more efficient

 Ends up being a slower design cycle due to the complexity

• Hybrid architecture permits continuing “microarchitectural”

(software) design late into the development phase

– Overlaps development phases traditionally not possible

 But what about co-designing hardware and software together?

Legacy Problem

• Intel x86 has to deal with enormous amount of legacy ISA features

– Even micro-architectural legacy is hard to overcome in traditional designs (too much complexity to easily revisit past solutions)

– But legacy is a part of life: demand for new features

– ARM 32-bit/64-bit

• Binary translation is a known mechanism to deal with this problem

– Apple Rosetta (PPC  x86)

– Houdini on Intel phones (ARM  x86)

 Hybrid architecture as a problem solver for legacy ISA?

– Or does it replace one (micro-architectural) legacy with another

(software)?

Validation

• A multi-year effort

– Development cycle is long

– Lack of abstractions lead to a lot of bugs being found during endto-end validation

– Reuse of components do help

• Hybrid architecture separates out a large piece of complexity

– Hardware is simpler  easier to validate

– Software uses high-level language abstraction (its not RTL)

– Parallelize validation of HW and SW components

 Are we addressing the validation challenge or making it even bigger?

Conclusion

• Hybrid microarchitecture is a promising area of research

– Combines a simple hardware and complex software system

– Continuous dynamic compilation and adaptation

• Appears promising in addressing microprocessor design challenges

– A number of open questions

– Academia can help us answer those

• Fodder for future thought

– Is Hybrid micro-architecture better suited to solving reliability, fault tolerance, determinism, extracting parallelism problems?

Download