Floating Point

advertisement
An energy-efficient combined floating point and integer
ALU for recongurable multi-core architectures
A literature study by
Tom Bruintjes
27/09/10
01/10/10
Floating Point Unit
1
Assignment
 Design or modify a Floating Point Unit so that can also be used as Integer
Unit, and determine its cost in terms of Area and Energy efficiency.
 Requirements
- Floating Point addition and multiplication & Integer addition and multiplication
- Pipeline should be shallow (preferably no more than 2-stages)
- Low area costs
- Low power consumption
01/10/10
2
Motivation
 Multicore architecture
- MPSoC
- Tile Processor
 Hetrogeneous but no Floating Point
- Too expensive (area, energy)
- Fixed Point alternative
- Software Emulation
Tilera TILE-Gx100
(100 cores but no floating point)
01/10/10
3
Motivation (2)
 What if we did add a FPU?
- High performance FP ops
- A lot of hardware needed
- Complex datapath → High latency (low frequency)
→ Deep pipeline
- A lot of area wasted if FP is idle
01/10/10
4
Motivation (3)
 Idea: Add FP core and make it compatible with Integer operation so that
Integer ops can be offloaded to the FP core when it is idle.
 The shared core should be deployable in an embedded system (MPSoC),
hence the low area and power consumption requirements.
 Few pipeline stages to keep compiler manageable.
01/10/10
5
Floating Point - History
 Need for FP recognized early
 The First FPU:
Konrad Zuse’s Z1 (1938)
- 22-bit floating-point numbers
- storage of 64 numbers
- sliding metal parts memory
01/10/10
6
Floating Point – History (2)
 In the beginning floating-point hardware was typically an optional feature
- “scientific computers”
- extremely expensive
 Then FP became available in the form of (“math”) Co-processors
- Intel x87 (486 vs )
- Weitek
 Mid 90’s: most GPP’s are equipped with FP units
 Current situation: FP also in small processors
01/10/10
7
Why Floating Point
 Unsigned/Signed
(…,-2,-1),0,1,2,3,…
[0000,0001,0010,0011]
- what about rational numbers or very large/small numbers ?
 Fixed Point
0.11, 1.22, 2.33,…
[00.11, 01.10, 10.11]
 Limited range and precision
- Solution: Floating Point (scientific) notation
- 1.220 x 105
(12.20 x 104 or 122.0 x 103, hence floating point)
01/10/10
8
Floating Point representation/terminology
 Floating Point representation
Significand
(mantissa)
Exponent
- Sign S
- Significand M (not Mantissa!)
- Exponent E (biased)
6.02 * 1023
- Base (implicit)
 Binary representation
Base (radix)
[1 | 00001111 | 10101010101010101010101]
01/10/10
Binary Floating Point storage (issues)
 Normalization
- Prevent redundancy: 0.122 * 105 vs 1.22 * 104
- Normalization means that the first bit is never a zero
- For binary numbers this means MSB is always 1 → “hidden bit”
 Single, Double or Quad precision
- 32 bits: single (23-bit significand & 8 bit exponent)
- 64 bits: double (52-bit significand & 8 bit exponent)
 Base is implicit
- 2, 10 or 16 are common
 Special cases? (NaN, 0, ∞)
01/10/10
The road to getting standardized
 Many ways to represent a FP number
- Significand (sign-magnitide, two’s complement, BCD)
- Exponent (biased, two’s complement)
- Special numbers
 Unorganized start
- Every company used their own format
- IBM, DEC, Cray
 Highly incompatible
- 2 * 1.0 on machine A gives a different result then B
- Situation even worse for exceptions (e.g., underflow and overflow)
01/10/10
11
IBM System/360 & Cray-1
 IBM highlights
- Sign magnitude & biased exponent
- Base-16 numeral system (more efficient/less accurate)
 Cray-1 highlights
- Sign magnitude & biased exponent
- Very high precision (64-bit single precision)
01/10/10
12
IEEE-754
 Standardized FP since 1985 (updated in 2008)
Arithmetic formats - binary and decimal Floating Point data (+special cases)
Operations - arithmetic and operations applied to arithmetic formats
Rounding algorithms - rounding routines for arithmetic and conversion
Exceptions handling - exceptional conditions
 Format (binary or decimal)
- Sign magnitide significand & biased exponent
- base-2 or base-10
- N = (-1) S * (1.M) * 2 e-127
01/10/10
13
IEEE-754 (2)
 Operations
- Minimum set: Add, Sub, Mul, Div, Rem, Rnd to Int, Comp
- Recommended set: Log,…
 Rounding modes
- Round to nearest, ties to even
- Round Up
- Round to zero
- Round down
 Exceptions
- Invalid operation
- Overflow
- Division by zero
- Underflow
- Overflow
- Underflow
01/10/10
14
Rounding
 Almost never exact FP representation
[1.11110]*25 (62d)
[1.11111]*25 (63d)
 Rounding is required
 IEEE-754 rounding modes:
- Round to nearest (ties to even)
- Round to zero
- Round up
- Round down
 Rounding (to nearest) algorithm based on 3 LSBs (guard bits)
0-- (down)
|
100 (even)
|
1-- (up)
01/10/10
15
Floating Point arithmetic
 More complex than Integer
 Lots of shifting results and overhead due to exceptional cases
 Addition
2.01 * 1012
1.33 * 1011
+
1. Check for zeros.
2. Align significands so exponents match (guard bits): rightshift!
3. Add/Subtract significands.
4. Normalize and Round the result
01/10/10
16
Floating Point addition
1. Check for zeros.
2. Align significands so exponents match
3. Add/Subtract significands.
4. Normalize and Round the result
01/10/10
17
Floating Point Arithmetic (2)
 Multiplication
1. Checking for zeros.
2. Multiplying significands
3. Adding exponents (correct for double bias)
4. Normalizing & Rounding the result
 Division
1. Checking for zeros.
2. Divide significands
3. Subtract exponents (correct for double bias)
4. Normalizing & Rounding the result
01/10/10
18
Floating Point Architecture
 Architecture is a combination of HW, SW, Format, Exceptions, …
 Focus on hardware (datapath) of a Floating Point Unit
- Multiplier
- Adder/Subtracter
(- Divider)
- Shifters
- Comparators
- Leading Zero Detection
- Incrementers
 How are components connected, what techniques are used and how
does that influence the efficiency of the FPU?
- Latency (paralelism)
- Throughput (ILP, pipeline stages)
- Area & Power (clockgating)
01/10/10
19
Highlighted Architectures
 UltraSparc T2
 Itanium
 Cell
01/10/10
20
UltraSparc T2
 UltraSparc T2 was released in 2007 by Sun
 Features
- Multicore (since 2008 SMP capable) microprocessor
- Eight cores, 8 threads = 64 threads concurrently
- Up to 1.6GHz
- Two Integer ALUs per core
- One FPU per core
- “Open” design
 Applications
- Only servers produced by Sun
Floating Point Unit
01/10/10
27/09/10
21
UltraSparc T2 Floating Point
 Eight cores, each with a FPU
- Single and Double precision IEEE
 Conventional FPU design
- Dedicated datapath for each instruction
 UltraSparc characteristics
- Pipeline for addition/multiplication
6 stages, 1 instruction per cycle → shared
- Combinatorial division datapath
- Area and power efficient
clock gating
reduced switching
01/10/10
22
Itanium
 Intel and HP combined efforts to revolutionize computer architecture in ‘98
- Complete overhaul of the legacy x86 architecture based on instruction level parallelism
- RISC replace by VLIW
- Large registers
 First Itanium appeared in 2001, the latest model (Tukwila) is from February
2010
 Tukwila features
- 2-4 Cores per CPU
- Up to 1.73GHz
- Four Integer ALUs per core
- Two FPUs per core
01/10/10
23
Itanium
 Very powerful very big
- Two full IEEE double precision FP units
- Leader in SPECfp
- Single and double precision + custom formats
 Architecture
- Unfortunately (too) much details are undisclosed
- So why look at Itanium at all? Because what has been disclosed is interesting:
Fused Multiply-Add
01/10/10
24
Fused Multiply Add
 FMA architecture fused multiply and add instructions
(A*C)+B
vs
A*C and A+B
 FMA advantages
- Atomic MAC operations (~double performance)
- Only one rounding error
 Expensive?
- Multiplication: Wallace Tree of CSAs
- Partial addition product: 3:2 CSA
- Full adder for conversion CS format
- Leading Zero Detection/Anticipation
- Shifters for alignment and Postnormalization
No: end-around-carry principle
01/10/10
25
End-around carry multiplication
 Carry-save adder vs Full adder
 CSA chain
 CSA tree
→
→
 Add one more CSA before conversion
01/10/10
26
Fused Multiply-Add (2)
 FP ops based on Fused Multiply-Add architecture
FMA: fma.[pc].[sf].f1 = f3 f4 f2
ADD: fadd.[pc].[sf].f1 = f3 (f0) f2
MUL: fmul.[pc].[sf].f1 = f3 f4 (f1)
f1 = (f2 * f4) + f2
f0 hardwired to +1.0
f1 hardwired to +0.0
- Not as efficient as single add and multiply instructions
 Division and Square Root
- Division and Square Root can be implemented in Software
- Lookup table for initial estimate (1/a and 1/√a)
- Newton Raphson approximation
(1 approximation and 13 FMA instructions on the Itanium)
- Intel FPU bug! ($475.000.000)
01/10/10
27
Cell
 Combined efforts from Sony, Toshiba and IBM
- Sony: Architecture & Applications
- IBM: SOI process technology
- Toshiba: Manufacturing
- Develpment started 2000, 400 people, $400M
- First Cell in 2006
 Applications
- Playstation 3
- Blue ray
- HDTVs
- High performance computing
 Features
- 9 cores (PPC and SPE) for Integer and FP
- 3.2GHz
- All SIMD instruction
01/10/10
28
Cell (2)
 1 PPC and 8 SPEs
- PPC for compatibility
- SPEs for performance
 1 FPU per SPE
- 4 single precision cores per FPU
- 1 double precision core per FPU
 Why separate?
- Performance requirements for SP Float
too high for a double precision unit
01/10/10
29
Single Precicion FP in the Cell
 Single precision
- Full FMA unit
- Similar approach as Itanium
- DIV/SQRT/Convert/… in software
 Aggressive optimization
- Denormal numbers forced to zero
- NaN/∞ treated as normal number
- Only round to zero
01/10/10
30
Shared Integer/FP ALUs
 Have FPUs been used for Integer operations in the past?
- Yes, in fact the UltraSparc T2 and Cell already do so
- Cell: converts Integers into some format that can be processed by the SPfpu
- UltraSparc: Maps Integer multiplication, addition and division directly on the respective FP
hardware, however not the full MAC capabilities…
 Issues
- Overhead due to FP specific hardware
- Priorities
- Starvation
01/10/10
31
Approach
 Design FPU
- Implement single precision core and drop most of the stuff that makes FP so expensive
…. Much like the Cell processor
- Widen the design to make it compatible with 32-bit Integer operands
 Add integer capability
- Add switches and control in the design to support Integer operands
- …without affecting FP performance
 Optimization
- Optimize the design for efficiency
- Area/Power
 Measure Performance, Area and Power Consumption
- 65 or 90nm
01/10/10
32
Approach – Floating Point Unit
 Formatting
- Close to IEEE format (Not GPP but don’t make it too obscure, i.e. Itanium)
- Sign magnitude
- Biased exponent
- Base-2
- Single Precision (double is excessive)
- Initially ignore special cases
 Architecture
- Fused-Multiply-Add unit only + compares
A la Cell: Shifter, Tree Multiplier, CSA, Full adder
- Initially three pipeline stage 1) Align/Multiply
2) Add/Prepare normalization
3) Post-normalize
- Reduce to two stages if possible
01/10/10
33
Approach – Floating Point Unit (2)
 IEEE-754 compatibility
- Format (not all the special cases)
- Arithmetic (next slide)
- Rounding modes
- Round to zero
- Round to nearest
- Round up
- Round down
Exceptions and special cases
- Denormalized numbers
- NaN, Infinity (to be determined)
- Exceptions (underflow, overflow, etc.)
01/10/10
34
Approach – Floating Point Unit (3)
 FP Arithmetic
- Multiplication
- Addition
- Division
- Square Root
- Conversion
}
→
→
→
Fused Multiply-Add
Software
Software
Software
- Compare
01/10/10
35
Approach – Integer Unit
 32-bit signed Integer ALU
- Preferably two’s complement (most common representation)
- Single precision maps nicely to 2x32bit registers
 Arithmetic mapping
- Addition
- Multiplication
- MAC
- Shift
→ Full adder
→ Wallace Tree
→ Aligner
 Reconfiguring
- Initially no bypassing (drain pipeline before reconfiguring)
01/10/10
36
Proposed architecture
 32-bit Input registers
- FP: 32-bit significand & 32-bit exponent
- Integer: 32-bit signed
 3-Stage pipeline
- Stage 1: Aligner for FP or Barrelshifter
32x32 Multiplier
- Stage 2: Full Adder and Leading Zero Det.
- Stage 3: Normalization and Rounding
 2-stage pipeline?
- Merge stage 2 and 3
01/10/10
37
Testing/Benchmarking
 After functional testing, implementation in 65 or 90nm
 Measure area and power usage
- Benchmark to be determined
01/10/10
38
Questions
Whatever the question,
lead is the answer.
01/10/10
39
Download