Floating Point

An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by Tom Bruintjes 27/09/10 01/10/10 Floating Point Unit 1 Assignment  Design or modify a Floating Point Unit so that can also be used as Integer Unit, and determine its cost in terms of Area and Energy efficiency.  Requirements - Floating Point addition and multiplication & Integer addition and multiplication - Pipeline should be shallow (preferably no more than 2-stages) - Low area costs - Low power consumption 01/10/10 2 Motivation  Multicore architecture - MPSoC - Tile Processor  Hetrogeneous but no Floating Point - Too expensive (area, energy) - Fixed Point alternative - Software Emulation Tilera TILE-Gx100 (100 cores but no floating point) 01/10/10 3 Motivation (2)  What if we did add a FPU? - High performance FP ops - A lot of hardware needed - Complex datapath → High latency (low frequency) → Deep pipeline - A lot of area wasted if FP is idle 01/10/10 4 Motivation (3)  Idea: Add FP core and make it compatible with Integer operation so that Integer ops can be offloaded to the FP core when it is idle.  The shared core should be deployable in an embedded system (MPSoC), hence the low area and power consumption requirements.  Few pipeline stages to keep compiler manageable. 01/10/10 5 Floating Point - History  Need for FP recognized early  The First FPU: Konrad Zuse’s Z1 (1938) - 22-bit floating-point numbers - storage of 64 numbers - sliding metal parts memory 01/10/10 6 Floating Point – History (2)  In the beginning floating-point hardware was typically an optional feature - “scientific computers” - extremely expensive  Then FP became available in the form of (“math”) Co-processors - Intel x87 (486 vs ) - Weitek  Mid 90’s: most GPP’s are equipped with FP units  Current situation: FP also in small processors 01/10/10 7 Why Floating Point  Unsigned/Signed (…,-2,-1),0,1,2,3,… [0000,0001,0010,0011] - what about rational numbers or very large/small numbers ?  Fixed Point 0.11, 1.22, 2.33,… [00.11, 01.10, 10.11]  Limited range and precision - Solution: Floating Point (scientific) notation - 1.220 x 105 (12.20 x 104 or 122.0 x 103, hence floating point) 01/10/10 8 Floating Point representation/terminology  Floating Point representation Significand (mantissa) Exponent - Sign S - Significand M (not Mantissa!) - Exponent E (biased) 6.02 * 1023 - Base (implicit)  Binary representation Base (radix) [1 | 00001111 | 10101010101010101010101] 01/10/10 Binary Floating Point storage (issues)  Normalization - Prevent redundancy: 0.122 * 105 vs 1.22 * 104 - Normalization means that the first bit is never a zero - For binary numbers this means MSB is always 1 → “hidden bit”  Single, Double or Quad precision - 32 bits: single (23-bit significand & 8 bit exponent) - 64 bits: double (52-bit significand & 8 bit exponent)  Base is implicit - 2, 10 or 16 are common  Special cases? (NaN, 0, ∞) 01/10/10 The road to getting standardized  Many ways to represent a FP number - Significand (sign-magnitide, two’s complement, BCD) - Exponent (biased, two’s complement) - Special numbers  Unorganized start - Every company used their own format - IBM, DEC, Cray  Highly incompatible - 2 * 1.0 on machine A gives a different result then B - Situation even worse for exceptions (e.g., underflow and overflow) 01/10/10 11 IBM System/360 & Cray-1  IBM highlights - Sign magnitude & biased exponent - Base-16 numeral system (more efficient/less accurate)  Cray-1 highlights - Sign magnitude & biased exponent - Very high precision (64-bit single precision) 01/10/10 12 IEEE-754  Standardized FP since 1985 (updated in 2008) Arithmetic formats - binary and decimal Floating Point data (+special cases) Operations - arithmetic and operations applied to arithmetic formats Rounding algorithms - rounding routines for arithmetic and conversion Exceptions handling - exceptional conditions  Format (binary or decimal) - Sign magnitide significand & biased exponent - base-2 or base-10 - N = (-1) S * (1.M) * 2 e-127 01/10/10 13 IEEE-754 (2)  Operations - Minimum set: Add, Sub, Mul, Div, Rem, Rnd to Int, Comp - Recommended set: Log,…  Rounding modes - Round to nearest, ties to even - Round Up - Round to zero - Round down  Exceptions - Invalid operation - Overflow - Division by zero - Underflow - Overflow - Underflow 01/10/10 14 Rounding  Almost never exact FP representation [1.11110]*25 (62d) [1.11111]*25 (63d)  Rounding is required  IEEE-754 rounding modes: - Round to nearest (ties to even) - Round to zero - Round up - Round down  Rounding (to nearest) algorithm based on 3 LSBs (guard bits) 0-- (down) | 100 (even) | 1-- (up) 01/10/10 15 Floating Point arithmetic  More complex than Integer  Lots of shifting results and overhead due to exceptional cases  Addition 2.01 * 1012 1.33 * 1011 + 1. Check for zeros. 2. Align significands so exponents match (guard bits): rightshift! 3. Add/Subtract significands. 4. Normalize and Round the result 01/10/10 16 Floating Point addition 1. Check for zeros. 2. Align significands so exponents match 3. Add/Subtract significands. 4. Normalize and Round the result 01/10/10 17 Floating Point Arithmetic (2)  Multiplication 1. Checking for zeros. 2. Multiplying significands 3. Adding exponents (correct for double bias) 4. Normalizing & Rounding the result  Division 1. Checking for zeros. 2. Divide significands 3. Subtract exponents (correct for double bias) 4. Normalizing & Rounding the result 01/10/10 18 Floating Point Architecture  Architecture is a combination of HW, SW, Format, Exceptions, …  Focus on hardware (datapath) of a Floating Point Unit - Multiplier - Adder/Subtracter (- Divider) - Shifters - Comparators - Leading Zero Detection - Incrementers  How are components connected, what techniques are used and how does that influence the efficiency of the FPU? - Latency (paralelism) - Throughput (ILP, pipeline stages) - Area & Power (clockgating) 01/10/10 19 Highlighted Architectures  UltraSparc T2  Itanium  Cell 01/10/10 20 UltraSparc T2  UltraSparc T2 was released in 2007 by Sun  Features - Multicore (since 2008 SMP capable) microprocessor - Eight cores, 8 threads = 64 threads concurrently - Up to 1.6GHz - Two Integer ALUs per core - One FPU per core - “Open” design  Applications - Only servers produced by Sun Floating Point Unit 01/10/10 27/09/10 21 UltraSparc T2 Floating Point  Eight cores, each with a FPU - Single and Double precision IEEE  Conventional FPU design - Dedicated datapath for each instruction  UltraSparc characteristics - Pipeline for addition/multiplication 6 stages, 1 instruction per cycle → shared - Combinatorial division datapath - Area and power efficient clock gating reduced switching 01/10/10 22 Itanium  Intel and HP combined efforts to revolutionize computer architecture in ‘98 - Complete overhaul of the legacy x86 architecture based on instruction level parallelism - RISC replace by VLIW - Large registers  First Itanium appeared in 2001, the latest model (Tukwila) is from February 2010  Tukwila features - 2-4 Cores per CPU - Up to 1.73GHz - Four Integer ALUs per core - Two FPUs per core 01/10/10 23 Itanium  Very powerful very big - Two full IEEE double precision FP units - Leader in SPECfp - Single and double precision + custom formats  Architecture - Unfortunately (too) much details are undisclosed - So why look at Itanium at all? Because what has been disclosed is interesting: Fused Multiply-Add 01/10/10 24 Fused Multiply Add  FMA architecture fused multiply and add instructions (A*C)+B vs A*C and A+B  FMA advantages - Atomic MAC operations (~double performance) - Only one rounding error  Expensive? - Multiplication: Wallace Tree of CSAs - Partial addition product: 3:2 CSA - Full adder for conversion CS format - Leading Zero Detection/Anticipation - Shifters for alignment and Postnormalization No: end-around-carry principle 01/10/10 25 End-around carry multiplication  Carry-save adder vs Full adder  CSA chain  CSA tree → →  Add one more CSA before conversion 01/10/10 26 Fused Multiply-Add (2)  FP ops based on Fused Multiply-Add architecture FMA: fma.[pc].[sf].f1 = f3 f4 f2 ADD: fadd.[pc].[sf].f1 = f3 (f0) f2 MUL: fmul.[pc].[sf].f1 = f3 f4 (f1) f1 = (f2 * f4) + f2 f0 hardwired to +1.0 f1 hardwired to +0.0 - Not as efficient as single add and multiply instructions  Division and Square Root - Division and Square Root can be implemented in Software - Lookup table for initial estimate (1/a and 1/√a) - Newton Raphson approximation (1 approximation and 13 FMA instructions on the Itanium) - Intel FPU bug! ($475.000.000) 01/10/10 27 Cell  Combined efforts from Sony, Toshiba and IBM - Sony: Architecture & Applications - IBM: SOI process technology - Toshiba: Manufacturing - Develpment started 2000, 400 people, $400M - First Cell in 2006  Applications - Playstation 3 - Blue ray - HDTVs - High performance computing  Features - 9 cores (PPC and SPE) for Integer and FP - 3.2GHz - All SIMD instruction 01/10/10 28 Cell (2)  1 PPC and 8 SPEs - PPC for compatibility - SPEs for performance  1 FPU per SPE - 4 single precision cores per FPU - 1 double precision core per FPU  Why separate? - Performance requirements for SP Float too high for a double precision unit 01/10/10 29 Single Precicion FP in the Cell  Single precision - Full FMA unit - Similar approach as Itanium - DIV/SQRT/Convert/… in software  Aggressive optimization - Denormal numbers forced to zero - NaN/∞ treated as normal number - Only round to zero 01/10/10 30 Shared Integer/FP ALUs  Have FPUs been used for Integer operations in the past? - Yes, in fact the UltraSparc T2 and Cell already do so - Cell: converts Integers into some format that can be processed by the SPfpu - UltraSparc: Maps Integer multiplication, addition and division directly on the respective FP hardware, however not the full MAC capabilities…  Issues - Overhead due to FP specific hardware - Priorities - Starvation 01/10/10 31 Approach  Design FPU - Implement single precision core and drop most of the stuff that makes FP so expensive …. Much like the Cell processor - Widen the design to make it compatible with 32-bit Integer operands  Add integer capability - Add switches and control in the design to support Integer operands - …without affecting FP performance  Optimization - Optimize the design for efficiency - Area/Power  Measure Performance, Area and Power Consumption - 65 or 90nm 01/10/10 32 Approach – Floating Point Unit  Formatting - Close to IEEE format (Not GPP but don’t make it too obscure, i.e. Itanium) - Sign magnitude - Biased exponent - Base-2 - Single Precision (double is excessive) - Initially ignore special cases  Architecture - Fused-Multiply-Add unit only + compares A la Cell: Shifter, Tree Multiplier, CSA, Full adder - Initially three pipeline stage 1) Align/Multiply 2) Add/Prepare normalization 3) Post-normalize - Reduce to two stages if possible 01/10/10 33 Approach – Floating Point Unit (2)  IEEE-754 compatibility - Format (not all the special cases) - Arithmetic (next slide) - Rounding modes - Round to zero - Round to nearest - Round up - Round down Exceptions and special cases - Denormalized numbers - NaN, Infinity (to be determined) - Exceptions (underflow, overflow, etc.) 01/10/10 34 Approach – Floating Point Unit (3)  FP Arithmetic - Multiplication - Addition - Division - Square Root - Conversion } → → → Fused Multiply-Add Software Software Software - Compare 01/10/10 35 Approach – Integer Unit  32-bit signed Integer ALU - Preferably two’s complement (most common representation) - Single precision maps nicely to 2x32bit registers  Arithmetic mapping - Addition - Multiplication - MAC - Shift → Full adder → Wallace Tree → Aligner  Reconfiguring - Initially no bypassing (drain pipeline before reconfiguring) 01/10/10 36 Proposed architecture  32-bit Input registers - FP: 32-bit significand & 32-bit exponent - Integer: 32-bit signed  3-Stage pipeline - Stage 1: Aligner for FP or Barrelshifter 32x32 Multiplier - Stage 2: Full Adder and Leading Zero Det. - Stage 3: Normalization and Rounding  2-stage pipeline? - Merge stage 2 and 3 01/10/10 37 Testing/Benchmarking  After functional testing, implementation in 65 or 90nm  Measure area and power usage - Benchmark to be determined 01/10/10 38 Questions Whatever the question, lead is the answer. 01/10/10 39

Floating Point

Related documents

Products

Support

Floating Point

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib