Digital Signal Processing (DSP)

advertisement
Kathy Grimes
• Signals
• Electrical
• Mechanical
• Acoustic
• Most real-world signals are Analog – they vary continuously
over time
• Many Limitations with Analog
• Repeatability
• Tolerances
• Difficulty storing information or implementing certain operations
Leads us to DSP…
• Represent signals by sequences of numbers
• Pros
• Repeatable
• Accuracy can be controlled
• Time-varying operations are easier to implement
• Cons
• Sampling cause loss of information
• Round-off errors
• A/D and D/A mixed-signal hardware
• Analog to Digital Converter
• Continuous to Discrete time signal
• 11.1 shows the sampling of a signal
FIGURE 11.1 Discrete Time Signals.
• Common Signals
• Step Discontinuity (Figure 11.2)
FIGURE 11.2 Step Function.
Impulse (Figure 11.3)
FIGURE 11.3 Impulse Function.
• Based off of three basic functions:
• Delay
• Add
• Multiply
FIGURE 11.6 Delay Function.
FIGURE 11.4 Add Function.
FIGURE 11.5 Multiply Function.
• Raw Performance for DSP algorithm is usually by # of ops
needed to execute
• These two systems in combination can be used to develop any
discrete difference equation
FIGURE 11.8 Feedback System.
FIGURE 11.7 Feedforward System.
• Floating-Point DSP perform Integer Operation
• Dynamic operating range
• Fixed-Point DSP perform Integer and Floating Operation
• Fixed range – 16 bit = 65536 max range
• Analog world signals = infinite precision
• Floating-point mimic the “infinite” range better
• Easier to implement, avoids rounding and overflow errors
• Why not always use Floating-point?
• Cost, Availability, Price, and Performance
• Precision Floating Point is good for smaller values but is poorer at
larger values using same number of bits
• SIMD Microarchitecture and Instructions
• One clock cycle for 4 data x(1 instruction)x 1 value
• Increase of performance for low-level DSP functions (MAC)
FIGURE 11.10 SIMD Instruction.
• Processor Clockspeed
• Cache size
• Usually DSP architectures manually partition the memory space
in order to reduce number of accesses to external memory
• Latency = costly in terms of time and resources
• Intel architectures have large amounts of cache and can
overcome the fast/slow memory, however, all memory starts in
“far” caches
• Output data should be generated sequentially Accessing
memory in a scattered pattern (while using threads) should be
avoided
• Intrinsic
• Vectorization
• Intel Performance Primitives
• C code that calls special built-in compiler capabilities that map
closely to underlying SSE instruction set
• Added Data Types
• _m64, _m128, _m128d, _m128i
• Intrinsic Operation Types
•
•
•
•
•
•
•
Arithmetic (fixed- and floating-point)
Shift
Logical
Compare
Adds four FP values packed into a and b and performs
Set
four additions in one instruction
Shuffle
Concatenation
• Use compiler to apply vectorization techniques to loops within
data processing iteration looks for opportunities to convert
loops from single set to vector-based implementation (so that
multiple operands can be operated at the same time)
• Like GCC -- >aligned with SIMD instruction set
• Use #pragma directives to guide compiler to avoid overheads
such as data dependces
Listing 11.4 Explicitly Don’t
Vectorize Loop.
Listing 11.7 Memory Alignment Property and
Discarding Assumed Data Dependences.
• Comparisons on Performance
• This performance would be vastly different if the memory was
not already aligned
• Intel Libraries – highly optimized implementations for many
different applications (include audio codecs, image processing,
data compression, etc…)
• Libraries take full advantage of CPU and SIMD (and most are
written for performance)
• Libraries are threaded and can obtain performance gains by
parallelizing the algorithm
• Libraries that take advantage are:
• Signal Processing – Convolution and correlation, Finite impulse response
(FIR) filter, FIR coefficints generation function, Infinite response filter (IIR),
Transforms
• Image Processing
• Small Matrices and Realistic Rendering
• Cryptography
• FIR filter equation
• Y[n] = a.x[n] + b.x[n-1] + c.x[n-2]
Listing 11.9 FIR Using Intel Performance Primitives.
Listing 11.8 FIR Filter C Code Example
• Loop Unrolling to get rid of data
dependences
• By changing the data elements, we
can reduce the number of times we
need to read data
• Computation intensive
• Needs a significant amount of embedded computational performance
• Same basic algorithmic pattern even though physical
configurations, parameters, and functionality are different
• Beam forming
• Envelope Extraction
• Polar-to-Cartesian coordinate translation
FIGURE 11.12 Block Diagram of a Typical Ultrasound Imaging Application.
FIGURE 11.15 Block Diagram of the Envelope Detector.
FIGURE 11.16 Polar-to-Cartesian Conversion of a Hypothetically Scanned Rectangular
Object.
Listing 11.11 Code Sample for Envelope Detector.
• Why such a large difference?
• Digital Signal Processing in general-purpose processors
• Extend Processing Capabilities
• Simplifies overall application when platforms require Control,
Communications, and General-purpose processing w/DSP
• Many ways to improve an Intel system by implementing special
C code, vectorization, and specific libraries
• Performance is greatly enhanced when DSP is implemented
properly
Download