Intel Pentium 4 ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk Overview: • • • • Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison Intel Pentium 4 • Reworked micro-architecture for highbandwidth applications • Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments • These are DSP intensive applications! – What about uses other than in PC? Hardware Features: (NetBurst micro-architecture) • • • • • • • Hyper pipelined technology Advanced dynamic execution Cache (data, L1, L2) Rapid ALU execution engines 400 MHz bus OOE Microcode ROM Hyper Pipeline • 20-stage pipeline!!! • breaks down complex CISC instructions – sub-stages mimic RISC – faster execution Filling the pipeline... • Review of next 126 instructions to be executed • Branch prediction – – – – if mispredict must flush 20-stage pipeline!!! branch target buffer (BTB) 4K branch history table (BHT) assembly instruction hints Cache • 8KB Data Cache • L1 Execution Trace Cache – 12K of previous micro-instructions stored – saves having to translate • L2 Advanced Transfer Cache – 256K for data – 256-bit transfer every cycle • allows 77GB/s data transfer on 2.4GHz Rapid ALU Execution Engines • 2 ALUs – allow parallel operations • Many arithmetic operations take 1/2 cycle – each 2X ALU can have 2 operations per cycle Software Features: • Multimedia Extensions (MMX) – 8 MMX registers • Streaming SIMD Extensions (SSE2) – 8 SSE/SSE2 registers • Standard x86 Registers – EAX, EBX, ECX, EDX, ESI, etc. – Register rename to over 100 MMX (Multimedia Extensions) • Accelerated performance through SIMD • multimedia, communication, internet applications • 64-bit packed INTEGER data – signed/unsigned SSE2 (Streaming SIMD Extensions) • Accelerate a broad range of applications – video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications • 128-bit SIMD instruction formats 4 single precision FP values 2 double precision FP values 16 byte values 8 word values 4 double word values 2 quad word values 1 128-bit integer value SIMD Example (16-tap FIR filter - Real numbers) • Applications for real FIR filters • general purpose filters in image processing, audio, and communication algorithms • Will utilize SSE2 SIMD instruction set Thinking about SIMD • SSE2 instruction format is 128-bits • 128-bit SSE2 registers • Many data formats! • What precision do we want? • Lets use 32-bit floating point for coefficients, input, output 4 data sets x 32-bit = 128 bits Parallelizing • Require many single multiplications (coefficients x inputs), then add the results for output! • Multiplications… • then need to perform additions... Using SSE2 format • Can hold 4 elements of an array (of 32-bit data) in each 128-bit register • 4 single precision floating point ops per cycle (32-bit) Additions... • In both registers, now have 4 32-bit results – First add the results into an accumulator register • 4 single precision floating point ops per cycle (32-bit) Additions... • In a register, now have 4 32-bit results – however, NO SSE2 instruction to add these 4! – But can use other instructions • Some BIT INTERTWINING…then add – This will give results for several output values! ADI SHARC 21k vs. P4 Disadvantages • Slower clock speed (40MHz vs 2400MHz) • Less opportunities for parallelism (5 vs 11) • Much less memory (Cache and System) – Limited algorithm applicability – Limited applications • Older (Less support – compiler) – 1994 vs 2001 ADI Sharc 21k vs. P4 • • • • • Advantages Hardware loops Easier to program for optimal speed Cheaper Lower power consumption Runs cooler FIR Performance • Hard to obtain P4 performance numbers • Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full. – 2 * 2.4GHz ~ 4.8 billion multiplies per second – If ~4 multiplies per element & 44000 samples/s – FIR length > ~25k taps • SHARC => ~ 200 taps (Lab 4) • Factor of ~125x IIR Performance • • • • Hard to obtain P4 performance numbers No hardware circular buffers Does have BTB, BHT, etc. Prefetches ~256bytes ahead of current position in code. FFT Performance • Hard to obtain P4 performance numbers • Prime95 uses FFT to calculate LucasLehmer test for Mersenne Primes – Involves FFT, squaring and iFFT, etc. • 256k points on P4 2.3GHz ~ 10.517ms • Compare to SHARC 2048 point FFT ~0.37ms • If SHARC could do 256k, 46.25ms (But…) Optimization Example • Hard to optimize Pentium 4 assembly • Example of multiplying by a constant, 10 • Taken mainly from: www.emulators.com/docs/pentium_1.htm Multiplying by 10 • Slowest way: – IMUL EAX, 10 • Usually optimal way (Visual C++ 6.0) – – – – – – LEA EAX, [EAX+EAX*4] SHL EAX, 1 Shift – Add – Shift On most x86 processors takes 2 cycles Pentium MMX and before 3 cycles On Pentium 4 takes 6 cycles! Multiplying by 10 • Optimal for Pentium 4 – – – – – LEA ECX, [EAX + EAX] LEA EAX, [ECX+EAX*8] On most x86 still takes 2 cycles On Pentium 4 takes ~ 3 cycles (OOE - Ops) But on older processors Pentium MMX and before this now takes 4 cycles! Multiplying by 10 • Best generic case – – – – LEA EAX, [EAX + EAX*4] ADD EAX, EAX On most x86 still takes 2 cycles On older processors Pentium MMX and before this now takes 3 cycles again – On Pentium 4 this takes 4 cycles • Obviously really hard to optimize REFERENCES • Intel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentions • graphics from: http://www6.tomshardware.com/cpu/00q4/0 01120/p4-01.html