Kathy Grimes • Signals • Electrical • Mechanical • Acoustic • Most real-world signals are Analog – they vary continuously over time • Many Limitations with Analog • Repeatability • Tolerances • Difficulty storing information or implementing certain operations Leads us to DSP… • Represent signals by sequences of numbers • Pros • Repeatable • Accuracy can be controlled • Time-varying operations are easier to implement • Cons • Sampling cause loss of information • Round-off errors • A/D and D/A mixed-signal hardware • Analog to Digital Converter • Continuous to Discrete time signal • 11.1 shows the sampling of a signal FIGURE 11.1 Discrete Time Signals. • Common Signals • Step Discontinuity (Figure 11.2) FIGURE 11.2 Step Function. Impulse (Figure 11.3) FIGURE 11.3 Impulse Function. • Based off of three basic functions: • Delay • Add • Multiply FIGURE 11.6 Delay Function. FIGURE 11.4 Add Function. FIGURE 11.5 Multiply Function. • Raw Performance for DSP algorithm is usually by # of ops needed to execute • These two systems in combination can be used to develop any discrete difference equation FIGURE 11.8 Feedback System. FIGURE 11.7 Feedforward System. • Floating-Point DSP perform Integer Operation • Dynamic operating range • Fixed-Point DSP perform Integer and Floating Operation • Fixed range – 16 bit = 65536 max range • Analog world signals = infinite precision • Floating-point mimic the “infinite” range better • Easier to implement, avoids rounding and overflow errors • Why not always use Floating-point? • Cost, Availability, Price, and Performance • Precision Floating Point is good for smaller values but is poorer at larger values using same number of bits • SIMD Microarchitecture and Instructions • One clock cycle for 4 data x(1 instruction)x 1 value • Increase of performance for low-level DSP functions (MAC) FIGURE 11.10 SIMD Instruction. • Processor Clockspeed • Cache size • Usually DSP architectures manually partition the memory space in order to reduce number of accesses to external memory • Latency = costly in terms of time and resources • Intel architectures have large amounts of cache and can overcome the fast/slow memory, however, all memory starts in “far” caches • Output data should be generated sequentially Accessing memory in a scattered pattern (while using threads) should be avoided • Intrinsic • Vectorization • Intel Performance Primitives • C code that calls special built-in compiler capabilities that map closely to underlying SSE instruction set • Added Data Types • _m64, _m128, _m128d, _m128i • Intrinsic Operation Types • • • • • • • Arithmetic (fixed- and floating-point) Shift Logical Compare Adds four FP values packed into a and b and performs Set four additions in one instruction Shuffle Concatenation • Use compiler to apply vectorization techniques to loops within data processing iteration looks for opportunities to convert loops from single set to vector-based implementation (so that multiple operands can be operated at the same time) • Like GCC -- >aligned with SIMD instruction set • Use #pragma directives to guide compiler to avoid overheads such as data dependces Listing 11.4 Explicitly Don’t Vectorize Loop. Listing 11.7 Memory Alignment Property and Discarding Assumed Data Dependences. • Comparisons on Performance • This performance would be vastly different if the memory was not already aligned • Intel Libraries – highly optimized implementations for many different applications (include audio codecs, image processing, data compression, etc…) • Libraries take full advantage of CPU and SIMD (and most are written for performance) • Libraries are threaded and can obtain performance gains by parallelizing the algorithm • Libraries that take advantage are: • Signal Processing – Convolution and correlation, Finite impulse response (FIR) filter, FIR coefficints generation function, Infinite response filter (IIR), Transforms • Image Processing • Small Matrices and Realistic Rendering • Cryptography • FIR filter equation • Y[n] = a.x[n] + b.x[n-1] + c.x[n-2] Listing 11.9 FIR Using Intel Performance Primitives. Listing 11.8 FIR Filter C Code Example • Loop Unrolling to get rid of data dependences • By changing the data elements, we can reduce the number of times we need to read data • Computation intensive • Needs a significant amount of embedded computational performance • Same basic algorithmic pattern even though physical configurations, parameters, and functionality are different • Beam forming • Envelope Extraction • Polar-to-Cartesian coordinate translation FIGURE 11.12 Block Diagram of a Typical Ultrasound Imaging Application. FIGURE 11.15 Block Diagram of the Envelope Detector. FIGURE 11.16 Polar-to-Cartesian Conversion of a Hypothetically Scanned Rectangular Object. Listing 11.11 Code Sample for Envelope Detector. • Why such a large difference? • Digital Signal Processing in general-purpose processors • Extend Processing Capabilities • Simplifies overall application when platforms require Control, Communications, and General-purpose processing w/DSP • Many ways to improve an Intel system by implementing special C code, vectorization, and specific libraries • Performance is greatly enhanced when DSP is implemented properly