Digital Signal Processing (DSP)

Kathy Grimes • Signals • Electrical • Mechanical • Acoustic • Most real-world signals are Analog – they vary continuously over time • Many Limitations with Analog • Repeatability • Tolerances • Difficulty storing information or implementing certain operations Leads us to DSP… • Represent signals by sequences of numbers • Pros • Repeatable • Accuracy can be controlled • Time-varying operations are easier to implement • Cons • Sampling cause loss of information • Round-off errors • A/D and D/A mixed-signal hardware • Analog to Digital Converter • Continuous to Discrete time signal • 11.1 shows the sampling of a signal FIGURE 11.1 Discrete Time Signals. • Common Signals • Step Discontinuity (Figure 11.2) FIGURE 11.2 Step Function. Impulse (Figure 11.3) FIGURE 11.3 Impulse Function. • Based off of three basic functions: • Delay • Add • Multiply FIGURE 11.6 Delay Function. FIGURE 11.4 Add Function. FIGURE 11.5 Multiply Function. • Raw Performance for DSP algorithm is usually by # of ops needed to execute • These two systems in combination can be used to develop any discrete difference equation FIGURE 11.8 Feedback System. FIGURE 11.7 Feedforward System. • Floating-Point DSP perform Integer Operation • Dynamic operating range • Fixed-Point DSP perform Integer and Floating Operation • Fixed range – 16 bit = 65536 max range • Analog world signals = infinite precision • Floating-point mimic the “infinite” range better • Easier to implement, avoids rounding and overflow errors • Why not always use Floating-point? • Cost, Availability, Price, and Performance • Precision Floating Point is good for smaller values but is poorer at larger values using same number of bits • SIMD Microarchitecture and Instructions • One clock cycle for 4 data x(1 instruction)x 1 value • Increase of performance for low-level DSP functions (MAC) FIGURE 11.10 SIMD Instruction. • Processor Clockspeed • Cache size • Usually DSP architectures manually partition the memory space in order to reduce number of accesses to external memory • Latency = costly in terms of time and resources • Intel architectures have large amounts of cache and can overcome the fast/slow memory, however, all memory starts in “far” caches • Output data should be generated sequentially Accessing memory in a scattered pattern (while using threads) should be avoided • Intrinsic • Vectorization • Intel Performance Primitives • C code that calls special built-in compiler capabilities that map closely to underlying SSE instruction set • Added Data Types • _m64, _m128, _m128d, _m128i • Intrinsic Operation Types • • • • • • • Arithmetic (fixed- and floating-point) Shift Logical Compare Adds four FP values packed into a and b and performs Set four additions in one instruction Shuffle Concatenation • Use compiler to apply vectorization techniques to loops within data processing iteration looks for opportunities to convert loops from single set to vector-based implementation (so that multiple operands can be operated at the same time) • Like GCC -- >aligned with SIMD instruction set • Use #pragma directives to guide compiler to avoid overheads such as data dependces Listing 11.4 Explicitly Don’t Vectorize Loop. Listing 11.7 Memory Alignment Property and Discarding Assumed Data Dependences. • Comparisons on Performance • This performance would be vastly different if the memory was not already aligned • Intel Libraries – highly optimized implementations for many different applications (include audio codecs, image processing, data compression, etc…) • Libraries take full advantage of CPU and SIMD (and most are written for performance) • Libraries are threaded and can obtain performance gains by parallelizing the algorithm • Libraries that take advantage are: • Signal Processing – Convolution and correlation, Finite impulse response (FIR) filter, FIR coefficints generation function, Infinite response filter (IIR), Transforms • Image Processing • Small Matrices and Realistic Rendering • Cryptography • FIR filter equation • Y[n] = a.x[n] + b.x[n-1] + c.x[n-2] Listing 11.9 FIR Using Intel Performance Primitives. Listing 11.8 FIR Filter C Code Example • Loop Unrolling to get rid of data dependences • By changing the data elements, we can reduce the number of times we need to read data • Computation intensive • Needs a significant amount of embedded computational performance • Same basic algorithmic pattern even though physical configurations, parameters, and functionality are different • Beam forming • Envelope Extraction • Polar-to-Cartesian coordinate translation FIGURE 11.12 Block Diagram of a Typical Ultrasound Imaging Application. FIGURE 11.15 Block Diagram of the Envelope Detector. FIGURE 11.16 Polar-to-Cartesian Conversion of a Hypothetically Scanned Rectangular Object. Listing 11.11 Code Sample for Envelope Detector. • Why such a large difference? • Digital Signal Processing in general-purpose processors • Extend Processing Capabilities • Simplifies overall application when platforms require Control, Communications, and General-purpose processing w/DSP • Many ways to improve an Intel system by implementing special C code, vectorization, and specific libraries • Performance is greatly enhanced when DSP is implemented properly

Digital Signal Processing (DSP)

Related documents

Products

Support

Digital Signal Processing (DSP)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib