Transmeta’s Crusoe Architecture Umran A. Khan Microprocessors Generations of Crusoe’s Processors Original architecture TM3120, TM5400 Later version TM5600-TM5800 The architecture is moreover the same, but is improved Faster clock rate (up to 800 MHz now) Smaller core/size (0.13 micron die) Has special instructions for the OS its emulating Lower power consumption Wider range of applications (from internet appliances to high density servers) We will look at the TM5400 here Instruction Set Uses a VLIW (Very Long Instruction Word) Instruction format/engine Instruction word is a 128 bit long packet Each word (also called molecule) has four individual execution units called atoms These atoms are packed into either a 128 or 64-bit chunks These atoms (operations) execute in parallel (4 operations per clock) These Operations must be independent from one and another Four Execution Units FPU (Floating Point Unit) Has a 10-stage floating point pipeline Uses conventional x86 80-bit register format 32 FP registers 2 Integer ALU (Arithmetic-Logic Units) Has a 7-stage integer pipeline 64 32-bit registers dedicated to it LSU (Load/Store Unit) Branch Unit Sample Instruction 128 bit Instruction FADD ADD LD BRCC FPU Integer LSU BU ALU#0 Figure copied from reference#1 (Load/Sore) (Branch) Introduction to Code Morphing Code Morphing Software is a clever translation software layer that dynamically recompiles a x86 program into its native VLIW instruction format Located in the Bios Rom and runs in main memory An entire group of instructions are translated at once and then is put into the translation cache Basically, an emulation mechanism It can be used for architectures other than x86 such as the Linux (TM3120), Alpha’s FX!32, but TM5400’s is known for its x86 compatibility Great Potential! Crusoe Translation layers X86 Bios CPU Core Operating System X86 Applications Code Morphing Layer Traditional x86 Architecture Ia32 instructions are translated by the cpu into more compact and uniformed RISC-like instructions (translates instruction individually) fancy/complicated translation It has dedicated hardware for x86 Instruction translation Branch prediction Register Renaming Instruction reOrder Transmeta’s Simplified Core Al lot of the processor functionality is implemented in software Its hardware if made up of execution units, the instruction decode unit and of course, the cache However, the rest of dedicated hardware (in previous slide) is done in software Advantages the cpu takes less die space less power demanding Less expensive for production and upgrades Hardware vs. Software Implemented the hardware in software comes with a cost Software is slower than hardware It is not so easy But how much slower? Its reordering registers, renaming registers, predicating branches on the fly, etc. using the same hardware used for addition, instruction execution, etc. adds complications Does the benefits outweigh the costs? According to Transmeta, IT DOES! Execution, Decoding and Scheduling In x86, Instructions are translated individually An instruction’s binary is fetched and decoded into n operations an These operations are reordered and are fed to the execution units (i.e. FPU, ALU, etc.) in parallel the sequence is reconstructed for execution out-of order execution has to be reconstructed in sequence and retranslated (complicated and costly) Execution, Decoding and Scheduling (Continued) In Crusoe, A group of instructions are translated at once Instructions are translated once and are placed into the translation cache If the same code is run again, the processor can grab it from the translation cache Instructions can by reordered by the scheduler by looking at the generated code Thus, the number of instructions executed can be minimized Caching and Optimization Translation cache used more efficiently A translation is optimized every time it is executed However, it will probably require more than pass for it to be truly optimized Optimization is done in steps Sections of code usually don't get optimized if they occur only once Code is recompiled quickly to keep the processor and programming running Uses common optimizations done by a ordinary compiler Optimizer is basically a simple compiler Optimization Strategies The Code Morphing software has many ways to gather feedback about a running program “Instrument Translation” Special code is used to collect information about the block that is going to be executed This info is later used for optimizations and translation Branch predictions, path speculations and the reordering loads and stores are done by the Code Morphing layer with some (Alias) hardware support and some condition code Filtering Determines how much effort must be spent on translation and optimizing a piece code Executions modes Interpretation, translation with or without optimization Translation Example FRONTEND ld %r30, [%esp] add.c %eax, %eax, %r30 ld %r31, [%esp] add.c %ebx, %ebx, %r31 ld %esi, [%ebp] sub.c %ecx, %ecx, 5 addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5 KEY ld – load movl - load Addl – load and add add.c - add with condition codes set Subl – load and sub sub.c - sub with condition codes set OPTIMIZER ld %r30, [%esp] add %eax, %eax, %r30 add %ebx, %ebx, %r30 ld %esi, [%ebp] sub.c %ecx, %ecx, 5 SCHEDULER ld %r30, [%esp]; sub.c %ecx, %ecx, 5 ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30 Example from reference#2 Power Management Typical power saving approaches Switching off the processor Having duty cycles Causes glitches Changing the clock rate by suspending to and restarting from the RAM Crusoe power saving Approaches Longrun power management (next slide) Integrated the north bridge of the chipset and RAM controllers onto the cpu core Can also integrate video and sound cards Saves power in the overall system Longrun Power Management Feature of Code Morphing Software layer by detecting cpu load Can adjust clock frequency on the fly Can dynamically change the cpu voltage It can reduce power consumption by 30% by lowering the cpu clock rate by 10% 30% = 100% x (1-(.9 x .99 )) Less heat problems No need for extra fans take up more power and space Conclusion Advantages low power consumption technology Low cost Longer battery life Great for the mobile user, embedded systems and even high density servers Smaller and lighter computers Code Morphing technology Can emulate any target architecture Compatibility Uses special optimization techniques for target Operating Systems Easier Software debugging (look at reference #1) Cheaper and Simplified upgrades Conclusion (Continued) Disadvantages An emulation can not be faster than the real thing Code translation requires extra cycles Code Morphing technology runs in main memory and takes up memory bandwidth Heavy coding Inherits the some of the same problems with other VLIW processors Need clever Compilers for parallelism Too much fixup code (for speculation, predictions, rollbacks, etc.) Technology seems to be really geared toward mobile users For desktops (power users) and servers, performance outweighs power consumption Performance is a measure of power consumption Final Thoughts Transmeta only reported a net revenue of $4.1 millions for the first quarter of 2002 No significant share in the mobile industry Even though Transmeta has a clever technology, the clock speeds of AMD and Intel have overshadowed its impact just like multiflow (clock speed are about 1.0 GHZ faster than the Crusoe) AMD and Intel have also develop their own power efficient mobile processors (mobile Athlon XP with AMD PowerNow!™ technology and mobile pentium 4 with Intel® SpeedStep® technology) Stay Tuned for the next Exciting Episode AMD, I am your father! VS. Not any more!!! References http://www.hardwareanalysis.com/content/editori als/article/1237.4/ http://www.transmeta.com/pdf/white_papers/pap er_aklaiber_19jan00.pdf http://www.arstechnica.com/cpu/1q00/crusoe/cru soe-1.html http://www.erc.msstate.edu/~reese/EE8063/html /transmeta/transmeta.pdf