Naveen Kumar
Intel Corporation
Santa Clara, CA
Jan 22, 2013
The following comments and opinions do not reflect
Intel’s official position and should be construed solely as Naveen Kumar’s personal opinions. Technical data in this presentation has been previously disclosed in academic and industrial venues.
• HW/SW co-designed, based upon the realization
– Hardware is good at some tasks
– Software is good at certain other tasks
– Co-design to leverage best of both worlds SW (DBT)
• Central tenets of Hybrid design HW
– Simplify hardware
– Move complexity into software
Hybrid
Microprocessor
• Commercial realization
– Transmeta Crusoe and Efficeon
– VLIW front-end, in-order backend and a DBT layer
– Continuous Dynamic Compilation and Adaptation
– Not the only possible solution …
ISA Public: x86
ISA Private:
RISC
• Transmeta Efficeon as background on Hybrid uarch
• Recent research work
– Assertion to improve scheduling [ASPLOS 2010]
– BlockChop to improve MP performance [ISCA 2012]
– Co-designed power optimizations [Under submission]
• Microprocessor design challenges
• Conclusion
• VLIW front-end and an in-order backend
– Upto 6 issue VLIW, RISC like instructions
– 2 loads or stores
– 2 integer ALU
– 2 SIMD
– 1 branch/call/other control
• Support for hardware atomicity
– Speculative bit in register files and cachelines, commit/rollbk ops
– Co-designed software leverages atomicity for aggressive optimizations
• Alias hardware
• What it did not have
– x86 decoders
– OOO backend (MOB, ROB)
• Dynamic Binary Translation and Optimization
– Speculation: Translator makes aggressive assumptions about code to achieve higher performance
– Recovery: Check assumptions and rollback to commit points if they prove to be false, for precise interpretation
– Adaptive retranslation: If recovery is required too often, retranslate with less aggressive assumptions
1 st Gear
Executes 1 instruction at a time
• Profiles code at runtime
• Gathers data for flow analysis
• Gathers branch frequencies and directions
• Detects load/store typing (IO vs memory)
Filters out infrequently executed code
No startup cost
Lowest speed
2 nd Gear
Uses profile information to create initial translation
• Translates a “Region” of up to100 x86 instructions.
• Adds flow graph “Shape” information
• Light Optimization
• “Greedy” scheduling
Low translation overhead
Fast execution
3 rd Gear
Further optimizes gear 2 regions
• Common sub-expression elimination
• Memory re-ordering
• Significant code optimization
• Critical path scheduling
Medium translation overhead
Faster execution
4 th Gear
Most advanced optimizations for hottest regions
• Splices together multiple regions
• Optimizes across region boundaries
• Critical path scheduling
Highest translation overhead
Fastest execution
(From Sept 2004 Microprocessor Forum)
• Interesting vision and implementation
• Fit for the time when it was launched
– 90 nm tech, now we are at 22 and going lower
– Low power is today’s norm
– Multicore, more ISA functionality, security, privacy, FT…
• More recently…
• A real system evaluation of hardware atomicity for software speculation [ASPLOS 2010]
• BlockChop: Dynamic Squash Elimination for Hybrid Processor
Architecture [ISCA 2012]
• Component Level Power Gating [Under Submission]
• Reducing Startup Time in Co-Designed Virtual Machines [ISCA 2006]
• LAR-CC: Large atomic regions with conditional commits [CGO 2011]
• A HW/SW co-designed heterogeneous multi-core virtual machine for energy-efficient general purpose computing [CGO 2011]
• …
A real system evaluation of hardware atomicity for software speculation [ASPLOS 2010]
• Dependence graph shown
• Atomic region trivially enables optimizer to eliminate operations
– 88 -> 73 Efficeon operations
– 7% reduction
• Relaxes scheduling constraints
– 26 -> 19 cycles
– 27% reduction
BlockChop: Dynamic Squash Elimination for Hybrid
Processor Architecture [ISCA 2012]
• Multiple atomic blocks may access and modify the same data block
• Violates atomicity assumption
• Results in a squash
• Frequent squashes kill performance Atomicity hurts!!
• Introduce BlockChop for shared data block conflicts
• BlockChop can leverage interpreter, translator, or region cache to respond to squashes
• Makes decisions on how to respond with a squash handling mechanism (SHM)
• Adaptive chopper performs remarkably well
Component Level Power Gating [Under Submission]
Intervals of execution
• Resource usage (SIMD unit) is variable: power-gate opportunity
Intervals of execution
• Resource usage (SIMD unit) is variable: power-gate opportunity
– An ideal predictor (red) can turn off VPU at fine granularity – but at what power cost of its own?
Intervals of execution
• Resource usage (SIMD unit) is variable: power-gate opportunity
– An ideal predictor (red) can turn off VPU at fine granularity – but at what power cost of its own?
– A predictor sampling at coarse granularity (blue) is sub-optimal
– How to get to the red curve?
• Profile execution to determine when a component is used
• Insert explicit (speculative) instructions to turn off or on a component
– No need for a complex hardware predictor
– Can still achieve near optimal power savings
• Speculation failure leads to recovery and adaptive retranslation (i.e., remove the instruction)
Lot of interest in hybrid microprocessors
What microprocessor design challenges do these address?
• Performance per watt important for smartphones & tablets to exascale computers
• Much of recent work on Hybrid architectures in this area
• Better design tradeoffs:
Promising but needs to be proven: a lot more work required
• Traditionally, microprocessors have not been built with security and privacy in mind
– No longer feasible to continue designing in that manner
– Rise of sophisticated cybercrime for commercial (criminal) profits and govt. spying
– Malware in every computing form factor: how long before they access your digital wallet on the phone
• Hybrid architecture permits a middleware to diagnose, prevent and react to attacks
– A microarchitecture that can inspect code and sandbox data
– Flexibly update the protections
How much better than an OS or a VMM?
• The number of form factors for computing devices has grown
– Each with a different requirement and a microarchitecture
– Different design goals imply different tradeoffs
– Industry heavyweights would like to consolidate
– Also brings down prices
Can we create a malleable architecture to span all segments?
What are the requirements and challenges?
• Design cycle of 4-5 years now considered too long
– Esp. when there are vendors who roll out a chip every two years
– Possible with simpler baseline, separate arch/uarch/fabrication designs
Also sub-optimal; co-designed arch with uarch and with knowledge of fabrication technology is more efficient
Ends up being a slower design cycle due to the complexity
• Hybrid architecture permits continuing “microarchitectural”
(software) design late into the development phase
– Overlaps development phases traditionally not possible
But what about co-designing hardware and software together?
• Intel x86 has to deal with enormous amount of legacy ISA features
– Even micro-architectural legacy is hard to overcome in traditional designs (too much complexity to easily revisit past solutions)
– But legacy is a part of life: demand for new features
– ARM 32-bit/64-bit
• Binary translation is a known mechanism to deal with this problem
– Apple Rosetta (PPC x86)
– Houdini on Intel phones (ARM x86)
Hybrid architecture as a problem solver for legacy ISA?
– Or does it replace one (micro-architectural) legacy with another
(software)?
• A multi-year effort
– Development cycle is long
– Lack of abstractions lead to a lot of bugs being found during endto-end validation
– Reuse of components do help
• Hybrid architecture separates out a large piece of complexity
– Hardware is simpler easier to validate
– Software uses high-level language abstraction (its not RTL)
– Parallelize validation of HW and SW components
Are we addressing the validation challenge or making it even bigger?
• Hybrid microarchitecture is a promising area of research
– Combines a simple hardware and complex software system
– Continuous dynamic compilation and adaptation
• Appears promising in addressing microprocessor design challenges
– A number of open questions
– Academia can help us answer those
• Fodder for future thought
– Is Hybrid micro-architecture better suited to solving reliability, fault tolerance, determinism, extracting parallelism problems?