Number Systems - My FIT - Florida Institute of Technology

101101010010 100101010010 100100100101 010010010101 001001010101 111000001111 Veton Këpuska Florida Institute of Technology Digital Signal Processing: From Theory to Practical Audio and Video Applications Digital Signal Processing: From Theory to Practical Audio and Video Applications Table of Contents 5. Chapter 3 5.1 Introduction 4 5.2 Numbers and Numerals 5 5.2.1 Number Systems 6 5.2.2 The Babylonian Systems 6 5.2.3 The Egyptian System 6 5.2.4 Maya Indians 7 5.2.5 The Greek System - Abacus 7 5.2.6 Roman System 7 5.2.7 Hindu-Arabic Numerals 7 5.3 Numbers 8 5.3.1 Whole Numbers 8 5.3.2 Integer Numbers 8 5.3.3 Fractions or Rational Numbers 8 5.3.4 Irrational Numbers 8 5.3.5 Real Numbers & Complex Numbers 9 5.4 Positional Number Systems 9 5.5 Sampling and Reconstruction of Signals 10 5.6 Scalar Quantization 13 5.6.1 Quantization Noise 17 5.6.1.1 Granular Distortion 17 5.6.1.2 Overload Distortion 18 5.6.1.3 Analysis of Quantization Noise 18 5.6.2 Signal-to-Noise Ratio 22 5.6.3 Transmission Rate 24 5.6.4 Nonuniform Quantizer 25 5.6.5 Companding 26 5.7 Data Representations 27 5.8 Fixed-Point Number Representations 28 5.8.1 Sign-Magnitude Format 29 5.8.2 One’s-Complement Format 30 5.8.3 Two’s-Complement Format 31 5.9 Fixed-Point DSP’s 32 5.10 Fixed-Point Representations Based on Radix-Point. 33 5.10.1 Dynamic Range 37 5.10.2 Precision 38 6. Implementation Considerations 41 6.1 Assembly 42 6.2 C – Language Support for Fractional Data Types 42 6.3 C++ – Language Support for Fractional Data Types 47 6.4 C vs. C++ Important Distinctions 48 2.Chapter 2 Chapter Digital Signal Representations To bridge the gap from theory to practice one has to master the conventions used to represent the data in a DSP processor. T 2.1 he details of digital representations of discrete-time signals are presented in this chapter bridging the gap from the abstract discrete-time signal notation presented earlier in the book, x[n], and its representation in a digital processor, and more specifically a digital signal processor (DSP). Introduction Continuous1 signals are necessarily sampled at discrete time intervals as well as approximated by a finite number of discrete magnitude values in order to be represented digitally. Because digital processing devices process data at discrete time steps, continuous signals must be sampled at discrete time intervals. It turns out that it is possible to sample continuous signals at discrete time intervals, producing discrete-time signals, without any loss or degradation as compared to original signal. Converted continuous signals from their discrete time representation are identical to its original if certain conditions are met. Those conditions are described in Sampling Theorem presented in Chapter 1 [1][2]. Additional limitation of digital processing devices, degree of which is dictated by their architecture, is the restriction that the data must be represented by a finite number of digits, or more specifically by finite number of bits. Typically, digital processors are designed to store and process data that have fixed specific minimal and maximal number Continuous signals are referred in literature also as Analog signals. Both terms are used here interchangeably unless stated otherwise. 1 D I G I T A L S I G N A L R E P R E S E N T A T I O N S of bits allocated for each representation. These restrictions impose representations to have finite-precision. The process of representing a continuous actual value by its discrete representation is known as Quantization. When finite-precision is used to represent actual values the following considerations are necessary in order to access the quantization effects on the output [1][3][4]. 1. Quantize in time and magnitude continuous input value x(t) of the signal to obtain discrete-time sequence x[n], 2. Quantize actual values of the coefficients {Ak, k=0,..,N} representing a DSP system (e.g., filter) with a finite-precision representation {ak, k=0,..,N}, and 3. Consider effects of arithmetic operations using finite-precision representations on the output and modify implementation as necessary to obtain optimal result. The effects of quantization on the continuous signal and finite-precision operations are well studied and understood [2][1][4][3]. Consequently, it is possible to convert continuous signals to digital, process it, and reconstruct it back to continuous representation with desired quality. Reconstructed signals typically have characteristics that fulfill certain quality criteria that are superior to analog counterparts. In the proceeding sections all three enumerated issues regarding representation of data with finite-precision are discussed. However, it is also important to understand development of abstract concept of numbers and historical roots of such representation. Discussion of numbers and number systems is introduced from a historical perspective that it is believed will shed light into fundamental concepts of numbers and number systems that shaped current understanding of the numbers and how they are represented. 2.2 Numbers and Numerals The development of human civilization is closely followed by the development of representations of numbers [5]. Numbers are represented by numerals2. In the past there were several kinds of numeral notations, notions and symbols. In the early days one pile of items was considered equivalent to another pile of different number of items of different kind. This value system was used for trading of goods. Further development was achieved with standardization of “value”; a fixed number of items of one kind (e.g., 5) placed in a special corresponding place, and it was considered Webster Dictionary defines numeral as: Function: noun 1 : a conventional symbol that represents a number 2 5 D I G I T A L S I G N A L R E P R E S E N T A T I O N S equivalent to one item of special kind placed in other place. This correspondence led to earlier ways of representing numbers in written form. Since the early days the way we do arithmetic is intimately related to the way we represent the numbers [6][5]. 2.2.1 Number Systems Earlier Number Systems named after the cultures/civilizations that used it [5] are listed below:       Babylonian Egyptian Maya Greek Roman Hindu-Arabic The Babylonian Systems The earliest recorded numerals are on Sumerian clay tables dating from the first half of the third millennium B.C. The Sumerian system was later taken over by the Babylonians. Everyday system for relatively small numbers was based on grouping by tens, hundreds, etc. inherited from Mesopotamian civilizations. Large numbers were seldom used. More difficult mathematical problems were considered by using sexagesimal (radix 60) positional notation. Sexagesimal notation was highly developed as early as 1750 B.C. This notation was unique in that it was actually a floating point form of representation with exponentials omitted. Proper scale factors or power of sixty was to be supplied by the context. The Babylonian cuneiform3 script was formed by impressing wedge-shaped marks in clay tables. It is because the ancients made astronomical calculations in base 60 that we still use this system for measuring time. One hour is comprised of 60 minutes, 1 minute of 60 seconds. Circle comprises 360° degrees (°) because earth circles the sun in roughly 360 days. Due to Babylonians, each degree is divided into 60 minutes (‘) and each minute into 60 seconds (‘’), and each second into 60 thirds (‘’’) [5]. Babylonian notation was positional (e.g., place value notation). The same symbol may mean 1, 60, 602, or … according to its position. Since they had no concept of zero this notation could be confusing because of ambiguity. The Egyptian System The Egyptian system used | for 1, ||||| for 5, ∩ for 10, ∩∩∩∩∩ for 50, etc. Because they used a different symbol for ones, tens, hundreds, thousands, etc. the range of 3 From Latin cuneus - wedge 6 D I G I T A L S I G N A L R E P R E S E N T A T I O N S numbers that could be represented was limited. Note that later Romans adopted this system to represent their numbers. Maya Indians From ancient civilizations only Maya Indians have used concept of “zero” as a quantity sometime around 200 A.D. They have also introduced fixed point notation as early as 1 century A.D. Their number system was a radix-20 system. The Greek System - Abacus Greek numerals from about 5th century B.C. used alphabetic characters (24 characters) to represent numbers. Since 27 symbols were needed three letters of Semitic origin were adopted. Greek Abacus originates at about 2nd century B.C. Row and Columns of pebbles organized in a matrix that correspond to our decimal system. Written form did not follow the positional notation of the decimal system. On the other hand, Greek astronomers make use of a sexagesimal positional notation for fractions, adapted from Babylonians. Roman System Because Roman numerals were in use in Europe for over a thousand years we are still familiar with them and use them in certain instances (clock faces, enumerated lists in written documents, monuments, etc.). The Roman number system was based on Etruscan letter notations I, V, X, L, L, C, D, and M for 1, 5, 10, 50, 100, 500, 1000. Subtractive principle, whereby 9 and 40 are written as IX and XL became popular during medieval times since it was hardly used by the Romans. It is interesting to note that original symbol for M (1000) was . The symbol ∞ is corruption of . In 1655 John Wallis proposed that this symbol be used for “infinity” [5]. Hindu-Arabic Numerals The numeration we use now: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, is often referred to as Arabic notation, but it is of Hindu origin. It is transmitted to Europe by Arab scholars. The value of a digit depends on its position in the system (its place in the number determines its value). Consequently zero is needed to be able to represent numbers unambiguously. For example 704 compared to 74. In fact it was this way that the concept of zero was forced itself onto Indian mathematicians. In theory, zero is also needed occasionally in the Babylonian system, but as the base is much larger, the context would usually supply the missing information. Consequently, Babylonians struggled on without zero for over a thousand years. Such earlier notations were inconvenient for performing arithmetic operations except for the simplest cases. Analysis of those earlier number systems also reveals two distinct approaches: sign-value notation (e.g., Roman Numeral System) and positional notation or place-value notation that is commonly used today. Furthermore, the abstract concept of a number and the objects being counted were not separable for a long time as exemplified by many languages. In those languages there are many names 7 D I G I T A L S I G N A L R E P R E S E N T A T I O N S for number of particular objects but not for the idea of numbers. For example, Fiji Islanders use “bolo” for ten boats, but “koro” for ten coconuts. In English language couple refers to two people, a century to 100 years, etc. 2.2.2 Types of Numbers In order to understand how the numbers are represented in modern digital computing systems it is important to know what kind of possible numbers are in use. Whole Numbers Whole numbers are 1, 2, 3, 4, …, defining the set ℕ, also called the counting or natural numbers. 0 is sometimes included in the list of "whole" numbers, but there seems to be no general agreement. Some authors also interpret "whole number" to mean "a number having fractional part of zero," making the whole numbers equivalent to the integers. Integer Numbers Advancement of mathematics brought by the discipline of algebra forced the recognition of negative number (e.g., to obtain solution of the following equation 2x+9=3 requires introduction of negative numbers). The set of whole numbers when extended with zero and negative whole numbers defines the set ℤ of integers: -4, -3, -2, -1, 0, 1, 2, 3, 4, … . Fractions or Rational Numbers A fraction or a rational number is defined as the ratio of two whole numbers p, q: p . q The set of all rational numbers is denoted with ℚ, derived from German word Quotient which can be translated as ratio. Most of the early systems used and named only few obvious common fractions. In the famous Rhind papyrus4, a famous document from the Egyptian Middle Kingdom that 1 1 1 1 dates to 1650 BC, only simple names for the unit fractions were used: , , , ,  , 2 3 4 5 2 and for . Other fractions when required were obtained by adding these simple 3 5 1 1 1 fractions. For example:    . 7 2 7 14 Irrational Numbers Discovery of irrational numbers is attributed to Pythagoras, who found that the diagonal of a square is not a rational multiple of its side (diagonal = 2 of a square with sides equal to 1). In other words, the ratio of diagonal to side cannot be expressed by whole 4 It was found in the memorial temple (or mortuary temple) of Pharaoh Ramesses II. 8 D I G I T A L S I G N A L R E P R E S E N T A T I O N S numbers. Irrational numbers have decimal expansions that neither terminate nor become periodic. Examples of irrational numbers are 2 , 3 ,  , e . Real Numbers & Complex Numbers The collection of rational and irrational numbers defines the set ℝ of real numbers. Real numbers can be extended to complex numbers with the addition of imaginary number i   1 . A complex number z is expressed as: z  x  iy Where x, y are real numbers, and i imaginary number. 2.2.3 Positional Number Systems In the positional notation, the value of a number depends on the numeral as well as its position within the number. Typically, the value of the position is the power of ten. For example, number represented by the numeral 1957 is equal to 7 one’s, 5 ten’s, 9 hundred’s, and 1 thousand’s. This concept leads to generalization of the value represented by a numeral as follows: Explicit Position of Radix Point x   d n1 B n1    d 2 B 2  d1 B1  d 0 B 0  d 1 B 1  d 2 B 2    d m B  m where: ± is sign of the number, di ∈ {0, 1, 2, …, B-1} are the set of numerals, “•” is decimal, or in general radix point, and B is the base of the number system. Note that the number to the left of the radix point, called integral part, denotes an integer part of the number represented by n numerals. The number to the right of the radix point, called fractional part, represents fractional number less than 1 represented by m numerals. With this notation, the set of real numbers ℝ can be represented [7]. Computers, can only use a finite subset of the numbers due to finite resources available to represent a number. Consequently, only finite and limited set of numbers can be represented. This set is defined by the total number of elements that it can represent as well as the range of values that it covers. Native representation of a numeral in a computer is in Binary system, or base B=2. The numerical value in our accustomed, reference base 10, number system of a base 2 (or binary) number is given by the following expression: x  bn1 2 n1    b2 2 2  b1 21  b0 2 0  b1 2 1  b2 2 2    bm 2  m 9 D I G I T A L S I G N A L R E P R E S E N T A T I O N S where: ± is sign of the number, bi ∈ {0, 1} takes values from the set of binary numerals, and “•” is binary point. The range of values and their precision is defined by n, number of bits used to represent the integer portion of a number, and m, number of bits to represent fractional part of the number. 2.3 Sampling and Reconstruction of Signals Typical DSP system interfaces with continuous world via Analog-to-Digital (ADC) and Digital-to-Analog (DAC) Converters as depicted in Figure 2.1. Analog Signal Processing Sensor Analog Signal Conditioning Digital Signal Processing Digital Signal Conditioning ADC DSP DAC Figure 2.1 DSP interfacing with continuous signal that is optionally conditioned by the sensor conditioner. In order to satisfy Sampling Theorem requirements, continuous input signal must be ensured to be band-limited. Thus, the ADC is preceded by a low-pass-filter [1]. This prefiltering is critical step in any digital processing system. It ensures that the effects of aliasing are minimized to the levels that are not perceptible by the intended audience. The filter is implemented as analog low-pass filter. The band-limited signal is then sampled by a fixed sample rate or equivalently sampling frequency fs. The sampling is performed by sample-and-hold device. This signal is then quantized and represented in a digital form as a sequence of binary digits/bits that have values of 1’s and 0’s. Quantized representation of the data is then converted to a desired digital representation of a DSP to facilitate further processing. The conversion process is depicted in Figure 2.2. 10 D I G I T A L S I G N A L R E P R E S E N T A T I O N S a) b) x(t) Analog Low-pass Filter Sample and Hold Analog to Digital Converter DSP c) Figure 2.2 Analog-to-Digital Conversion. a) Continuous Signal xt  . b) Sampled signal xa nT  with sampling period T satisfying Nyquist rate as specified by Sampling Theorem. c) Digital sequence xn obtained after sampling and quantization Example 2.1 Assume that the input continuous-time signal is pure periodic signal represented by the following expression: xt   A sin  0 t     A sin 2f 0 t    where A is amplitude of the signal, 0 is frequency in radians per second (rad/sec),  is phase in radians, and f0 is frequency in cycles per second measured in Hertz (Hz). Assuming that the continuous-time signal x(t) is sampled every T seconds or alternatively with the sampling rate of fs=1/T, the discrete-time signal x[n] representation obtained by t=nT will be: xn  A sin 0nT     A sin 2f 0nT    11 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Alternative representation of x[n]:   f xn  A sin  2 0 n     A sin 2F0 n     A sin 0n    fs   reveals additional properties of the discrete-time signal. The F0= f0/fs defines normalized frequency, and 0 digital frequency defined as: 0  2F0  0T , 0  0  2 A DSP processor performs a programmed operation, typically a complex algorithm, on the suitably represented input signal. The result is obtained as a sequence of digital values. Those values after being converted into an appropriate data representation (e.g., 24 bit signed integers) are converted back into continuous domain via digital-to-analog converter (DAC). The procedure is depicted in the Figure 2.3. a) b) DSP y[n] c) Digital to Analog Converter ya(nT) Analog Low-pass Filter y(t) Figure 2.3 Digital-to-Analog Conversion. a) Processed digital signal y[n]. b) Continuous signal representation ya(nT). c) Low-pass filtered continuous signal y(t). Quantization in time, via sampling, as well as in amplitude of continuous input signals, x(t), to discrete-time signal x[n], as well as coefficients of digital signal processing structures requires also resolving how the numbers are represented by a digital signal processor. The next section will discuss issues of quantization, numbers and their representations. 12 D I G I T A L 2.4 S I G N A L R E P R E S E N T A T I O N S Scalar Quantization The component of the system that transforms an input value x[n] into one of a finite set of prescribed values x̂n is called scalar quantization. As depicted in Figure 2.2, this function is depicted with ideal sample-and-hold followed by Analog to Digital Converter. This function can be further refined by the representation depicted in Figure 2.4. The ideal C/D converter represents the sampling performed by the sample-and-hold, and quantizer and coder combined represent ADC. x(t) C/D Quantizer Coder Figure 2.4 Conceptual representation of ADC. This conceptual abstraction allows us to assume that the sequence xn is obtained with infinite precession. Those values of xn are scalar quantized to a set of finite precision amplitudes denoted here by xˆQ n . Furthermore, quantization allows that this finiteprecision set of amplitudes to be represented by corresponding set of (bit) patterns or symbols, x̂n . Without loss of generality, it can be assumed that input signals cover finite range of values defined by minimal, xmin and maximal values xmax respectively. This assumption in turn implies that the set of symbols representing x̂n is finite. The process of representing finite set of values to a finite set of symbols is know as encoding; performed by the coder, as in Figure 2.4. Thus one can view quantization and coding as a mapping of infinite precision value of xn to a finite precision representation x̂n picked from a finite set of symbols. Quantization, therefore, is a mapping of a value x[n], xmin  x  xmax, to x̂n . The quantizer operator, denoted by Q(x), is defined by: xˆ[n]  xî  Qx[n], xi-1  x[n]  xi where x̂i denotes one of L possible quantization levels where 1 ≤ i ≤ L and xi represents one of L +1 decision levels. 13 D I G I T A L S I G N A L R E P R E S E N T A T I O N S The above expression is interpreted as follows; If xi-1  x[n]  xi , then x[n] is quantized to the quantization level x̂i and x̂n is considered quantized sample of xn . Clearly from the limited range of input values and finite number of symbols it follows that quantization is characterized by its quantization step size i defined by the difference of two consecutive decision levels:  i  xi  xi 1 Example 2.2 Assume there are L = 24 = 16 reconstruction levels. Assuming that input values xn fall within the range [xmin=-1, xmax=1] and that the each value in this range is equally likely5. Decision levels and reconstruction levels are equally spaced; =i,= (xmax-xmin)/L i=0, …, L.-1,  Decision Levels:  15 13 11 9 7 5 3   3 5 7 9 11 13 15   2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2   Reconstruction Levels:  8,7,6,5,4,3,2,,,2,3,4,5,6,7,8 xˆ  Qx  5 14 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Figure 2.5 Example of a uniform quantization for L=16 levels. As it is discussed in following sections, L=16 levels, require 4 bits of codeword to represent each level. Because the distribution of the input values is uniform the decision and reconstruction levels are uniformly spaced. In previous Example 2.2 a uniform quantizer was described. Here a uniform quantizer is formally defined as one whose decision and reconstruction levels are uniformly spaced. Specifically:    i  xi  x i 1, x  xi 1 xî  i , 2 1 i  L 1 i  L Thus, , the step size equal to the spacing between any two consecutive decision levels, is constant for any two consecutive reconstruction levels in a uniform quantizer. Each reconstruction level is attached a symbol – or the codeword. Binary numbers are typically used to represent the quantized samples. The term Codebook, refers to collection of all codewords or symbols. In general, with B-bit binary codebook there are 2B different quantization (or reconstruction) levels. This representational issue is detailed in the following sections. When designing or applying a uniform scalar quantizer, the knowledge of the maximum value of the sequence is required. Typically the range of the input signal (e.g., speech, audio, video), is expressed in terms of the standard deviation, x, of the probability density function (pdf) of the signals’ amplitudes. Specifically, it is often assumed that the range of input values is equal to: -4x≤x[n]≤4x where x is signal’s standard deviation. In addition to quantization, many algorithms depend on accurate yet simple mathematical models describing statistics of signals. Several studies have been conducted on speech signals assuming that speech signal amplitudes are realizations of a random process. More recently, accuracy of several pdf models were evaluated as function duration of speech segment used for capturing speech statistics [9]. The following functions, also depicted in Figure 2.6, are evaluated as models of speech signal pdf’s: G A M M A D I S T R I B U T I O N  3 f x     8 x x L A P L A C I A N 1  2 3x  exp     2 x   D I S T R I B U T I O N 15  ,    x     D I G I T A L S I G N A L R E P R E S E N T A T I O N S  1 f x     2 x G A U S S I A N   2x  exp      x    ,    x     D I S T R I B U T I O N  1 f x     2  x 2     exp   x 2 ,    x       2 x   where x - is Standard Deviation. Figure 2.6. Pdf models of Speech sample distributions. For speech signals, under the assumption that speech samples obey Laplacian pdf, approximately 0.35% of speech samples fall outside of this range defined by 4x. Example 2.3. Assume B-bit binary codebook having thus 2B codewords or symbols. Maximum signal value is set to xmax = 4x. What is the quantization step size of a uniform quantizer? 2 xmax 2x  2 B  2 xmax   2 B    max  2B 16 D I G I T A L S I G N A L R E P R E S E N T A T I O N S From the discussion presented this far it is clear that quality of representation is related to step size of the quantizer, , which in turn depends on the number of bits B used to represent a signal value. The quality of quantization typically is expressed as function of the step size  and relates directly to the notion of quantization noise. 2.4.1 Quantization Noise There are two classes of quantization noise:   Granular Distortion, and Overload Distortion Granular Distortion Granular distortion occurs for the values of x[n], unquantized signal, which fall within the range of the quantizer [xmin, xmax]. The quantization noise, e[n], is the error that occurs because infinite precision value x[n] is approximated with finite precision value of quantized representation x̂n . Specifically, quantization error e[n] is defined as difference of quantized value x̂n from true value x[n]: en  xˆn  xn For a given step size  the magnitude of the quantization noise e[n], can be no greater than /2, that is:     en   2 2 Example 2.4. For the periodic sine-wave signal use 3-bit and 8-bit quantizer values. The input periodic signal is given with the following expression: xn  cos0n , 0  2F0  0.76  2 MATLAB fix function is used to simulate quantization. The following figure depicts the result of the analysis. 17 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Figure 2.7 Plot a) represents sequence x[n] with infinite precision, b) represents quantized version x̂n , c) represents quantization error e[n] for B=3 bits (L=8 quantization levels), and d) is quantization error for B=8 bits (L=256 quantization levels). Overload Distortion Overload distortion occurs when the samples fall outside range covered by the quantizer. Those samples are typically clipped and they incur a quantization error in excess of /2. Due to the small number of clipped samples it is common to neglect the infrequent large errors in theoretical calculations. Often the goal of signal processing in general and specifically audio or image processing is to maintain the bit rate as low as possible while maintaining a required level of quality. Meeting those two criteria requires fulfilling competing requirements. Analysis of Quantization Noise Desired approach in analyzing the quantization error in numerous applications is to assume quantization error is an ergodic white-noise random process. This implies that the process, that is quantization error e[n], is uncorrelated. In addition, it is also assumed that the quantization noise and the input signal are uncorrelated, i.e., E(x[n]e[n+m])=0, 18 D I G I T A L S I G N A L R E P R E S E N T A T I O N S  m. Final assumption is that the pdf of the quantization noise is uniform over the quantization interval:    1 ,  e  pe     2 2  0, otherwize  Stated assumptions are not always valid. Consider a slowly varying input signal x[n], then quantization error e[n] is also changing linearly, thus being signal dependent as depicted in the Figure 2.8. Furthermore, correlated quantization noise can be annoying (e.g., image sequences – tv, or audio). Figure 2.8. Example of slowly varying signal that causes quantization error to be correlated. Plot a) represents sequence x[n] with infinite precision, b) represents quantized version x̂n , c) represents quantization error e[n] for B=3 bits (L=9 quantization levels), and d) is quantization error for B=8 bits (L=256 quantization levels). Note reduction in correlation level with increase of number of quantization levels which implies degrease of step size . 19 D I G I T A L S I G N A L R E P R E S E N T A T I O N S As illustrated in Figure 2.8, when quantization step  is small then assumptions for the noise being uncorrelated with itself and the signal are roughly valid particularly when the signal fluctuates rapidly among all quantization levels. In this case, quantization error approaches a white-noise process with an impulsive autocorrelation and flat spectrum. Next figure demonstrates quantization effects on the speech6 signal. Figure 2.9. Example of speech signal demonstrating the effect of step size to the degree of correlation of quantization error. Plot a) represents sequence x[n] with infinite precision, b) represents quantized version x̂n , c) represents quantization error e[n] for B=3 bits (L=9 quantization levels) that is clearly highly correlated with original signal x[n], and d) is quantization error for B=8 bits (L=256 quantization levels). Note reduction in correlation level with increase of number of quantization levels which implies degrease of step size . 6 The signal was taken from the file: TEST\DR3\FPKT0\si1538.wav of the TIMIT corpus. 20 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Figure 2.10 Histogram of quantization error for speech signal. Plots depict distribution of quantization errors with a) L = 23, b) L = 28 and c) L = 216 quantization levels. Note the reduction of error magnitude as well as increase of uniformity of the distribution with increase of number of quantization levels. As depicted in the Figure 2.10, with increase in number quantization levels L, degrease of correlation marked as flattening of the distribution approaching to a uniform can be observed. Additional approach can be used to force e[n] to be white-noise and uncorrelated with x[n] by adding white-noise to x[n] prior to quantization. The effect of this approach is demonstrated in the next Figure 2.11, obtained by adding insignificant amount of Gaussian noise with zero mean and variance of 5 to the original signal. Dramatic improvement is clearly visible particularly for the L = 216 quantization level by comparing distributions with the one in previous Figure 2.10. 21 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Figure 2.11. Histogram of quantization error for speech signal after adding Gaussian noise with zero mean and variance of 5. Plots depict distribution of quantization errors with a) L = 23, b) L = 28 and c) L = 216 quantization levels. Note the increase of uniformity of the distributions compared to the case where no noise was added. Process of adding white noise is known as Dithering. This de-correlation technique was shown to be useful not only in improving the perceptual quality of the quantization noise of speech signals but also with image signals. 2.4.2 Signal-to-Noise Ratio A measure to quantify severity of the quantization noise is signal-to-noise ratio (SNR). It relates the strength of the signal to the strength of the quantization noise, and it is formally defined as: 1 N 1 2 x [ n]  x2 E x 2 [n] N  n 0 SNR  2   1 N 1 2  e E e 2 [ n]  e [ n] N n 0     Given the following assumptions: 22 D I G I T A L S I G N A L R E P R E S E N T A T I O N S  Quantizer range: 2xmax, and  Quantization interval: = 2xmax /2B, for a B-bit quantizer, and  Uniform pdf of the quantization error e[n], it can be shown that (see Example 5.5)  2 xmax 2 B 2    12 12 2 e  2 2 xmax  32 2 B Thus SNR can be expressed as: SNR  2B   x2 322 B 2  32      x 2 2   e2  xmax  xmax  x  or in decibels dB as:  2  x2 SNRdB   10 log 10  x2   10log 10 3  2 B log 10 2  10 log 10 max2 x e  x  6 B  4.77  20 log 10 max x Assuming that maximal value xmax, obtained from the pdf of the distribution of x[n], is set to xmax = 4x, then SNR(dB) expression becomes: SNRdB  6B  7.2 Example 2.5. For uniform quantizer with the quantization interval , derive the variance of the error signal. Consider that the signal is random with uniform probability distribution within the interval defined by  as defined in the figure below. The mean and variance of the p(e) are first two moments, m, of the random process defined as expected value of random variable e: E e  m    e pede m  Thus mean and variance of the p(e) are: 23 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Figure 2.12 Uniform probability distribution function p(e) of error signal in the range [/2, /2].  me   2 1   de  0   2  2   2   2 1 2 1 2e  2  2 2    e de  2  e de    3  2   12    3    0 0 3 2 e  3 2 2.4.3 Transmission Rate Another important factor in utilizing the DSP processors is the Bit rate R, defined as the number of bits per second streamed from the input into the DSP. Bit rate is computed with the following expression where fs is sample rate in Hz or samples per second, and B is the number of bits used to represent a sample: R  Bf s Presented quantization scheme is called pulse code modulation (PCM), where B-bits per sample are transmitted as a codeword. 24 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Advantages of this scheme are:   It is instantaneous (no coding delay) Independent of the signal content (voice, music, etc.) Disadvantages:  It requires high bit rate for good quality. Example 2.6. For “toll quality” (equivalent to a typical telephone quality) of signal minimum of 11 bits per sample is required. For 10000 Hz sampling rate, the required bit rate is: B=(11 bits/sample)x(10000 samples/sec)=110,000 bps=110 kbps. For CD quality signal with sample rate of 20000 Hz and 16-bits/sample, SNR(dB) =96-7.2=88.8 dB and bit rate of 320 kbps. Because sampling rate is fixed for most applications this goal implies that the bit rate be reduced by decreasing the number of bits per sample. This area is of significant importance for communication systems and is know as Coding [1][5]. However, this coding refers to the information encoding procedures beyond the representation of the numerical values that is being discussed here. However, as indicated earlier, uniform quantization is optimal only if distribution of input samples x[n] is uniform. Thus, uniform quantization may not be optimal in general - SNR can not be as small as possible for certain number of decision and reconstruction levels. Consider for example speech signal for which x[n] is much more likely to be in one particular region than in other (low values occurring much more often than the high values), as exemplified by Figure 2.6. This implies that decision and reconstruction levels are not being utilized effectively with uniform intervals over xmax. Clearly, optimal solution must account for distribution of input samples. 2.4.4 Nonuniform Quantizer A quantization that is optimal (in a least-squared error sense) for a particular pdf is referred to as the Max Quantizer. For a random variable x with a known pdf, it is required to find the set of M quantizer levels that minimizes the quantization error. Therefore, finding the decision and reconstruction levels xi and x̂i , respectively, that minimizes the mean-squared error (MSE) distortion measure: 25 D I G I T A L S I G N A L R E P R E S E N T A T I O N S  2 D  E xi  xî   E-denotes expected value and x̂i is the quantized version of xi, would give us optimal decision levels. It turns out that optimal decision levels are given by the following expression: xk  xˆ k 1  xˆ k , 2 1  k  L-1 On the other hand, the optimal reconstruction level xk is the centroid of px(x) over the interval xk-1 ≤ x ≤xk computed by the following expression:     xk p x x    ~ xˆ k    xk xdx   p x x dx xk 1 xk 1   p x xdx   xk 1  xk The above expression is interpreted as the mean value of x over interval xk-1 ≤ x ≤xk for p x x  . the normalized pdf ~ Solving last two equations for xk and x̂ k is a nonlinear problem in these two variables. There are iterative solution which requires obtaining pdf ‘s of x; accurate estimate of which can be difficult [1][10]. 2.4.5 Companding The idea behind companding is based on the fact that uniform quantizer is optimal for a uniform pdf, thus, if a nonlinearity transformation T is applied to unquantized input x[n] to form a new sequence y[n] whose pdf is uniform. Uniform quantizer can be applied to y[n] to obtain ŷn as depicted in the Figure 2.13. A companding operation compresses dynamic range of input samples for encoding and expands dynamic range on decodeing. Optimal application of compading prodecude requires accurate estimation of pdf of input x[n] values from which non-linear transformation T can be derived. In practice, however, such transformations are standardized under CCITT international standard coder at 64 kbps; specifically A-law and –law companding. A-law is used in Europe while –law in North America. The -law transformation is given by: 26 D I G I T A L S I G N A L R E P R E S E N T A T I O N S T ( x[n])  xmax  xn   log 1    x max   sign ( x[n]) log 1    The -law transformation for 255, North American PCM standard, is followed by 8bit uniform quantization, 7-bits for value and 1-bit for sign, achieves “toll quality of speech” in telephone channels. Achieved toll quality is equivalent quality to straight uniform quantization using 12 bits. Due to standardization, in digital telephone networks and voice modems, standard CODEC7 chips are used in which audio is digitized in a 8-bit format. x[n] Nonlinearity Transformation T y[n] c’[n] Decoder Uniform Quantizer Q[y] c[n] Encoder Nonlinearity Transformation T-1 Figure 2.13 Block diagram of companding in the transmitting and receiving DSP system. x[n] is unquanitzed input sample with a nonuniform pdf of values, y[n] is the value obtained after nonlinear transformation with uniform pdf values, ŷn is quantized sample, c[n] is encoded binary representation of this sample value. This binary encoded stream is typically transmitted to a receiving system where it is converted back to original by decoding the input ˆ n, and applying inverse of the non-linear encoded samples c’[n] to y' -1 transformation T obtaining sequence x’[n]. If c’[n] = c[n], x’[n] differs from x[n] by amount of introduced quantization noise. 2.5 Data Representations The DSP’s, similarly to general computer processors, support a number of data formats. The variety of data formats and computational operations determine DSP capabilities. Most general classification of DSP processors is in terms of their hardware support of 7 CODEC is derived from CODer-DECoder. 27 D I G I T A L S I G N A L R E P R E S E N T A T I O N S data types for various operations (e.g., addition, subtraction, multiplication and division). DSP’s are thus categorized as fixed-point or floating-point devices. Fixed-point data are computer representations of integer numbers. Floating-point data types are computer representations of real numbers. 2.6 Fixed-Point Number Representations In theory the range of values that a number can take is unlimited. That is, an integer number can take values ranging from  ∞ to  ∞:  , ,3,2,1,0,1,2,3,, Due to limitations of the hardware, the integer representations in a computer are restricted to a range that is directly dependent on the number of bits allocated for numbers. For example, if a processor uses 4 bits to represent a number, there are total of 24 = 16 possible distinct combinations. If one would use those 4 bits to represent non negative integers (called unsigned data type in conventional programming languages like C/C++, Java, Fortran, etc.), the range of values that can be represented is thus (0, …, 15). If positive as well as negative numbers are needed, half of the bits are used to represent positive and the remaining half represent negative numbers. It is necessary, therefore to use one bit from the set of bits allocated (e.g., typically Most Significant Bit or MSB is used, in this case bit number 3) to represent the sign of a number. There are several different binary number representational conventions for signed and unsigned numbers. Most notable are: 1. Sign Magnitude 2. One’s Complement 3. Two’s Complement Example of 4-bit signed numbers is presented in the Table 2.1 below for three formats listed above: Decimal Value +7 +6 +5 +4 +3 +2 +1 +0 Sign Magnitude 0111 0110 0101 0100 0011 0010 0001 0000 One’s Complement 0111 0110 0101 0100 0011 0010 0001 0000 28 Two’s Complement 0111 0110 0101 0100 0011 0010 0001 0000 D I G I T A L S I G N A L R E P R E S E N T A T I O N S -0 -1 -2 -3 -4 -5 -6 -7 -8 1000 1001 1010 1011 1100 1101 1110 1111 - 1111 1110 1101 1100 1011 1010 1001 1000 - 1111 1110 1101 1100 1011 1010 1001 1000 Table 2.1 Example of 4 bit number representations. 2.6.1 Sign-Magnitude Format As depicted in Table 2.1, signed integers (positive and negative values) in this format use the MSB bit to represent the sign of the number and the remaining bits are used to represent its magnitude. A 16-bit sign magnitude formant representation is depicted in the Figure 2.14 below. Signed Number Representation Bit Position Position Value 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 -215 214 213 212 211 210 29 28 27 26 25 24 23 22 21 20 Mangitude Sign Bit Radix Point Figure 2.14 Sign Magnitude Format This format thus, has two possible representations for 0, one with positive sign and one with negative sign as depicted in Table 2.1. This poses additional complications in designing the hardware to carry out operations. This issue is discussed further in following sections. With 4 bits the range of values cover the interval [-7,+7]. With 16 bits the range is defined with following interval [-32,767, +32,767]. In general, with n bits in sign-magnitude format the integers in the range from –(2n-1-1) to +(2n-1-1) can only be represented. This format has two drawbacks. The first drawback already mentioned is that it has two different representations for 0. The second drawback is that it requires two different rules one for addition and one for subtraction, and a way to compare magnitudes to determine their relative values prior to applying subtraction. This in turn would require a more complex hardware to carry out those rules. 29 D I G I T A L S I G N A L R E P R E S E N T A T I O N S 2.6.2 One’s-Complement Format As depicted in Table 2.1, negative values –x are obtained by negating or complementing each bit of the binary representation of a positive integer x. For an n-bit binary representation of a number x, the following is its one’s-complement: x ˆ bn 1 b2b1b0 , n  bit representa tion of a number x x ˆ bn 1 b2b1b0 , n  bit representa tion of one' s complement of x where bi is complement of bit bi . Clearly, the following holds: x  x  1111  2 n  1 2-1 Similarly to sign-magnitude formant, MSB is used to represent the sign of a number. Positive number will have MSB value of “0” which after complement operation will become “1” leading to a negative integer number. Clearly, remaining n-1 bits will represent a number itself if positive; otherwise they will represent one’s complement. Applying equation 2.1 the following expression depicts one’s-complement representation format:  x, x1  ˆ  x   n 2   n 2 x0 x0 x0 x,  1  x  x0  x, 1  x x0 x0  2-2 Similarly to sign-magnitude representation, with 4 bits we can represent integers in the range defined by the interval [-7,+7], also as depicted in Table 2.1. In general, with n bits one’s-complement format can represent the integers in the range from –(2n-1-1) to +(2n1 -1). One’s-complement format is superior to sign-magnitude formant in that the addition and subtraction require only one rule; specifically that of the addition, since subtraction can be carried out by performing addition on one’s-complemented number as depicted below by applying equation 5.2:   z1   x1   y1   x1   | y1  |  x1   2n  1  y1  30 5-1 D I G I T A L S I G N A L R E P R E S E N T A T I O N S It turns out that addition of one’s-complement numbers is a bit complicated to implement in hardware. Also, additional extra bit is required to represent the least significant bit (20) to manage overflow. This problem is alleviated by two’s-complement representation discussed next. 2.6.3 Two’s-Complement Format The n-bit two’s-complement number ~ x of a positive integer x is defined by the following expression: ~ x  x  1  2n  x x~ x  2n 2-3 As depicted in Table 2.1, the disadvantage of having two representations for the number zero is eliminated. As before, MSB is used to represent the sign of the number. Using equation 2.3 the two’s-complement format representation is given by:  x, x2  ˆ  ~ x x0 x0 x0  x,  n 2  x x0 x0  x,  n 2 x x0   applyingEq. 2.3 2-4   With two’s-complement representation, obtained by shifting to the right by incrementing one’s-complement representation, the problem of two zeros is alleviated. Consequently the range of the negative numbers has increased by one compared to previous representations, as depicted in Table 2.1. With 4 bits the range of integer values is defined by the interval [-8, +7], also as depicted in Table 2.1. In general, with n bits two’s-complement format can represent the integers in the range from –(2n-1) to +(2n-11). The following lists the advantages of the two’s-complement formant representation: 1. It is compatible with the notion of negation, that is, the complement of complement is the number itself. 2. It unifies the subtraction and addition operations since subtractions are essentially additions of two’s-complement representation of a number. 31 D I G I T A L S I G N A L R E P R E S E N T A T I O N S 3. For summation of more than two numbers, the internal overflows do not affect the final result so long as the result is within the range; adding two positive number results in positive number, and adding two negative numbers gives a negative result. Due to these properties, two’s-complement is the preferred format in representing negative integer numbers. Consequently, almost all current processors, including DSP’s implement signed arithmetic using this format and provide special functions to support it. 2.7 Fixed-Point DSP’s The fixed-point DSP’s hardware supports only fixed-point data types. Their hardware is thus more restrictive performing basic operations only on fixed-point data types. With software emulation, the fixed-point DSP’s can execute floating-point operations. However, floating-point operations are done at the expense of the performance due to lack of floating point hardware. Lower-end fixed point DSP’s are 16-bit architectures. That is, the processors’ word length is 16 bit and its basic operations use 16-bit data types. Typically, 16-bit DSPs also support double precision – 2x16=32-bit data type. This extended support may come at the expense of performance of the processor depending on their hardware architecture and design. The 16-bit signed and unsigned data type’s format is given below in the Error! Reference source not found.. There are a number of possible fixed-point representations that DSP hardware may support. One example was presented in Table 2.1. For 16 bit representations the ranges of numbers are given in the following Table 2.2. Analog Devices’ BF533 family architecture supports two’s-complement integer formats. Clearly, the range of values (commonly referred in the literature as dynamic range) is proportional to the number of bits used to represent a number. If the result of the operation exceeds the precision of the data type, in the worst case scenario the resulting number will overflow and wrap around generating a large error, or at best if it is handled (by hardware, setting the overflow flag in processors arithmetic and logic unit’s status register or by software, checking the input values before the operation to detect potential overflow) it will be saturated to the maximal/minimal value of corresponding data type leading to truncation error. Since most common DSP operations require multiply and accumulate operations, these kinds of representations where the magnitude of the number is directly mapped in the processor requires special handling to avoid truncation effects. In addition, exceeding the precision provided by the dynamic range of the data type typically introduces non-linear effects producing large errors and sometimes breaks the algorithm. 32 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Unsigned Number Representation Bit Position Position Value 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 215 214 213 212 211 210 29 28 27 26 25 24 23 22 21 20 Radix Point Signed Number Representation Bit Position Position Value 15 -2 15 14 13 12 11 10 14 13 12 11 10 2 2 2 2 2 9 2 9 8 2 8 7 2 7 6 2 6 5 2 5 4 2 4 3 2 3 2 2 2 1 2 1 Sign Bit 0 20 Radix Point Figure 2.15 The 16-bit Unsigned and Signed data type’s representation. 16-bits MIN VALUE MAX VALUE 32-bits MIN VALUE MAX VALUE 16-bits MIN VALUE MAX VALUE 32-bits MIN VALUE MAX VALUE Unsigned Fixed Point Numbers Sign Magnitude One’s Complement 0 2 =0 N/A 216=65536 N/A 20=0 N/A 232=4294967296 N/A Signed Fixed Point Numbers Two’s Complement N/A N/A N/A N/A -216-1+1= -32767 +216-1-1= +32767 -216-1+1= -32767 +216-1-1= +32767 -216-1= -32768 +216-1-1= +32767 -232-1+1 = -2147483647 +232-1-1 = +2147483647 -232-1+1 = -2147483647 +232-1-1 = +2147483647 -232-1 = -2147483648 +232-1-1 = +2147483647 Table 2.2 Range of values represented by a 16 bit DSP. 33 D I G I T A L S I G N A L R E P R E S E N T A T I O N S 2.8 Fixed-Point Representations Based on Radix-Point One way to view possible fixed-point representations by a processor is based on the implied position of radix point. In the examples of integer fixed-point representations discussed previously, zero bits were used after the radix point. This implies the following representation depicting the integer formants in DSP: Implicit Position of Radix Point x   d n B n    d 2 B 2  d1 B 1  d 0 B 0  Example 2.7. A DSP uses 4 bits to represent input fixed-point all integer numbers. Two’s complement format is used for negative numbers. The tables below indicate the resulting number if the operations are carried using 4-bits. Operand 1 4 –bits Binary (Decimal) Operation Operand 2 4-bits Binary (Decimal) 0011(+3) 0110(+6) 1101(-3) 1010(-6) + + + + 0010(+2) 0101(+5) 0111(+7) 1000(-8) Resulting Number 4-bits Binary (Decimal) 0101(+5) 1011(-5) 0100(+4) 0010(+2) Comment No Overflow Overflow No Overflow Overflow In the cases of overflow it becomes necessary to handle it in order to minimize the resulting error. In the table above the resulting overflow errors are: 6+5=11 compared to erroneous result of (-5), or -6-8=-14 vs. +2; both being 16. Overflow can be avoided by doubling the precision of the resulting number; that is using 8 bits as depicted in the table below. Operand 1 4 –bits Binary (Decimal) Operation Operand 2 4-bits Binary (Decimal) Resulting Number 8-bits Binary (Decimal) 0011(+3) 0110(+6) 1101(-3) + + + 0010(+2) 0101(+5) 0111(+7) 0000 0101(+ 5) 0000 1011(+11) 0000 0100(+ 4) 34 D I G I T A L S I G N A L 1010(-6) R E P R E S E N T A T I O N S + 1000(-8) 1111 0100(-12) Next table depicts results of the multiplication of two 4-bit numbers with 4-bit resulting number. The errors due to overflow are large even when 4 most significant bits (MSB) are used for resulting numbers. For example (+7)x(+6)=(+42) but the resulting number from the 4 MSB’s is 2 which introduces an error of 40. Similarly, (-6)x(-5)=(-35) with the resulting number (-2) and error of 33. Operand 1 4-bits Binary (Decimal) Operation Operand 2 4-bits Binary (Decimal) 0010(+2) 0111(+7) 1101(-3) 1010(-6) x x x x 0011(+3) 0110(+6) 0010(+2) 1011(-5) Resulting Number 4-bits Binary (Decimal) 0110(+ 6) 0010(+ 2) 0110(+ 6) 1110(- 2) Comment No Overflow Overflow No Overflow Overflow Similarly to addition operation, the overflow can be avoided for multiplication if the resulting precision is doubled compared to input operands. The table below demonstrates this case. Operand 1 4-bits Binary (Decimal) Operation Operand 2 4-bits Binary (Decimal) Resulting Number 8-bits Binary (Decimal) 0010(+2) 0111(+7) 1101(-3) 1010(-6) x x x x 0011(+3) 0110(+6) 0010(+2) 1011(-5) 0000 0110(+ 6) 0010 1010(+42) 0000 0110(+ 6) 1110 0010(-30) In all but trivial algorithms is possible to keep advancing the precision of the resulting number. Those algorithms require a number of iterations involving intermediate data from input data to generate resulting output. In order to achieve error free operation each intermediate output has to have double number of bits compared to its inputs. Iterative application of this approach will quickly exceed the hardware capabilities of the DSP (e.g., 32 bit int data type). However, there are several techniques that enable DSP developers to bound the errors to within the tolerated margins as established by the application. This issue will be discussed further in next chapter. 35 D I G I T A L S I G N A L R E P R E S E N T A T I O N S There are alternative representations that have better properties then the common integer formats presented and in proceeding sections. One such representation requires all numbers to be scaled within the interval [-1, 1). This is obviously all fractional representation using fixed-point architecture. Note, this representation is not to be confused with fractional numbers in floating point representations which use different format and rules in the hardware to perform floating point operations. In all fractional representation, allocated bits are used to cover fixed dynamic range between -1 and 1. Clearly, larger the number of bits used for fractional numbers finer the representation (finer granularity). This stands in contrast to previous magnitude representation where the granularity is fixed and is equal to 1 – constant difference of any two consecutive numbers. Imposing a constant range may potentially be considered a drawback of this fractional representation since it may require keeping track of the scaling factor used to translate the original range of values to fixed [-1, 1) range. On the other hand, this representation provides much better properties in terms of truncation error as well as overflow. Truncation error and overflow requires special consideration in fixed point integer representation discussed earlier. On the other hand, the 16-bit fractional representations do not require overflow handling in multiplication; multiplication of two fractional numbers with values between -1 and 1 will produce a resulting number also in the same range. The only consideration with the 16-bit fractional representation is underflow that typically does not require special handling. Underflow incurs the error when the result of an operation is less than the granularity of the representation. If this error can not be tolerated additional rescaling of the intermediate data is required, otherwise the effect of error falls below the granularity of the representation; i.e., smallest represent able number. A fractional fixed-point representation assumes radix-point to be in the left-most position implying that all bits have positional values less than one. The general notation of this fractional representation is: Implicit Position of Radix Point x    d 1 B 1  d 2 B 2    d m1 B m1 This representation is also depicted in the following Figure 5-1. s Figure 5-1. Fractional Fixed-Point Representation. 36 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Presented integer and fractional fixed-point representations depict two possible number representational schemes utilizing two extreme positions of implied radix-point. Since radix-point position defines the notation, a formal definition of such representational scheme is based precisely on it. Let N be the total number of bits used to represent a number. Also let p denote the number of bits to the left of the radix point specifying the integral portion of a number, and with q number of bits to the right of radix-point specifying the fractional portion of a number. Notation Qp.q specifies the format of the representation used as well as the position the implied radix-point as well as the precision of the representation. For example, the unsigned integer fixed-point format is expressed as Q16.0 since all bits lay to the left of radix-point. Consequently, signed 16-bit integer fixed-point format is denoted by Q15.0 with 1 bit used to represent the sign of a number. The all fractional representation uses Q0.16 and Q0.15 format for unsigned and signed numbers respectively. In general, for unsigned numbers the relationship between total number of bits N and p, q is: N  p  q  unsigned For signed numbers the following relationship holds: N  p  q  1  signed In light of introduced notation, a number represented by Qp.q of a binary signed format has a value that can be computed by the following expression: Num   bN 1 2 N 1  bN  2 2 N  2  bN 3 2 N 3    b0 2 0 2  q  bN 1 2 p   bk 2 k  q N 2 k 0 For unsigned numbers the following expression can be used: Num  bN 1 2 N 1  bN 2 2 N 2  bN 3 2 N 3    b0 20 2 q   bk 2 k q N 1 k 0 2.8.1 Dynamic Range Now a formal definition of the dynamic range of a data representation can be stated. Dynamic, range given in dB scale, is defined as the ratio of the largest number (Max) and the smallest positive number greater then zero (Min) of a data representation. It is computed by the following expression:  Max  DR [dB]  20 log 10    Min  37 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Dynamic Range of signed and unsigned integer and fractional 16-bit representations are given in the following table: Dynamic Range Unsigned Integer (16-bits) Dynamic Range [dB]  216  1  20 log 10  0   96dB  2  [0,65535] Precision 1 [-32768,32767]  215  1  20 log 10  0   90dB  2  1 Unsigned Fractional (16-bits) [0, 0.9999847412109375]  1  2 16  20 log 10  16   96dB  2  2-16 Signed Fractional (16-bits) [-1, 0.999969482421875]  1  2 15  20 log 10  15   90dB  2  2-15 Signed Integer (16-bits) Table 5- 1. Dynamic Range and Precision of 16-bit signed and unsigned integer and fractional representations. 2.8.2 Precision Earlier, we have introduced the concept of the granularity of a representation. Here, it is formally defined as the precision of a representation. The precision of a representation is thus the difference of two consecutive numbers. It has to be noted that this difference is the smallest step between any two representations. In the Table 5- 2 below, largest and smallest positive and negative values as well as corresponding precessions for fractional and integer 16-bit data types are given. Unsigned Fractional and Integer 16-bit Representations Format Number Number of Largest Value Smallest Value Precision of Integer Fractional Bits Bits Q0.16 0 16 0.9999847412109375 0.0 0.0000152587890625 Q16.0 16 0 65535.000000000000000 0 1.00000000000000 Signed Fractional and Integer 16-bit Representations Format Number Number of Largest Positive Value Least Negative Precision of Integer Fractional Value Bits Bits Q0.15 0 15 0.999969482421875 -1.0 0.000030517578125 Q15.0 15 0 32767.000000000000000 -32768 1.00000000000000 Table 5- 2. Maximal, Minimal, and Precision values for Integer and Fractional 16-bit fixed-point representations. 38 D I G I T A L S I G N A L R E P R E S E N T A T I O N S In addition to 16-bit signed and unsigned representations already discussed; namely Q16.0, Q15.0 integer and Q0.16 and Q0.15 fractional, there is a whole range of representations in between that could be used that combine pure integer and pure fractional representations. Different Qp.q formats are depicted in the next Figure 5- 2. The following table uses format definitions in the figure to present the ranges of possible 16-bits signed numbers that can be represented in a DSP. Signed Number Representation Bit Position Position Q0.15 Value 15 14 13 12 11 10 0 -1 -2 -3 -4 -5 -2 2 9 -6 8 -7 7 -8 6 -9 5 4 3 2 1 -10 -11 -12 -13 -14 2 2 2 0 2-15 2 2 2 2 2 2 2 2 2 2 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-10 2-11 2-12 2-13 2-14 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-10 2-11 2-12 2-13 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-10 2-11 2-12 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-10 2-11 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-10 27 26 25 24 23 22 21 20 2-1 Sign Bit Radix Point Position Q1.14 Value -21 20 Sign Bit Radix Point Position Q2.13 Value -22 21 Radix Point Sign Bit Position Q3.12 Value -23 20 22 21 Radix Point Sign Bit Position Q4.11 Value -24 20 23 22 21 Radix Point Sign Bit Position Q5.10 Value -25 20 24 23 22 21 Radix Point Sign Bit Position Q14.1 Value -214 213 20 212 211 210 29 28 Sign Bit Position Q15.0 Value -215 214 Radix Point 213 212 211 210 29 28 27 26 25 24 23 22 Sign Bit 21 20 Radix Point Figure 5- 2. Possible Qp.q format representations of 16-bit data length. 39 D I G I T A L S I G N A L R E P R E S E N T A T I O N S The Table 5- 3 below summaries the properties of all possible16-bit representations in terms of number of integer bits (p), number of fractional bits (q), largest positive value, least negative value and precision for various Q formats. Format Number Number of of Integer Fractional Bits Bits Q0.15 0 15 Q1.14 1 14 Q2.13 2 13 Q3.12 3 12 Q4.11 4 11 Q5.10 5 10 Q6.9 6 9 Q7.8 7 8 Q8.7 8 7 Q9.6 9 6 Q10.5 10 5 Q11.4 11 4 Q12.3 12 3 Q13.2 13 2 Q14.1 14 1 Q15.0 15 0 Signed16-bit Representations Largest Positive Value Least Negative in Decimal Value in Decimal (0x7FFF) (0x8000) 0.999969482421875 -1.0 1.999938964843750 -2.0 3.999877929687500 -4.0 7.999511718750000 -8.0 15.999511718750000 -16.0 31.999023437500000 -32.0 63.998046875000000 -64.0 127.996093750000000 -128.0 255.992187500000000 -256.0 511.984375000000000 -512.0 1023.968750000000000 -1024.0 2047.937500000000000 -2048.0 4095.875000000000000 -4096.0 8191.750000000000000 -8192.0 16383.500000000000000 -16384.0 32767.000000000000000 -32768.0 Precision 0.000030517578125 0.000061035156250 0.000122070312500 0.000244140625000 0.000488281250000 0.000976562500000 0.001953125000000 0.003906250000000 0.007812500000000 0.015625000000000 0.031250000000000 0.062500000000000 0.125000000000000 0.250000000000000 0.50000000000000 1.00000000000000 Table 5- 3. Maximal, Minimal, and Precision values for Integer and Fractional 16-bit fixed-point representations. 40 D I G I T A L S I G N A L R E P R E S E N T A T I O N S 3.Implementation Considerations 41 D I G I T A L S I G N A L R E P R E S E N T A T I O N S 3 Chapter Implementation Considerations To properly implement a design in specific DSP hardware familiarity with specifics of development tools that relate to it is crucial. M ost DSP processors use two’s complement fractional number representations in different Q formats. The native formats for the Blackfin DSP family are a signed fractional Q1.(N-1) and unsigned fractional Q0.N format, where N is the number of bits in the data word. Depending on the compiler setting the following must be considered 3.1 Assembly Since the assembler only recognizes integer values the programmer must keep track of the position of the binary point when manipulating fractional numbers [11]. The following steps provided below, can be used to convert a fractional number in Q format into an integer value that can be recognized by the assembler. 1. Normalize the fractional number to the range determined by the desired Q format. 2. Multiply the normalized fractional number by 2n, where n is the total number of fractional bits. 3. Round the product to the nearest integer. 3.2 C – Language Support for Fractional Data Types The C/C++ run-time environment of Blackfin DSP processor family uses the intrinsic C/C++ data types and data formats that are listed in the table below. 42 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Type Size in Bits Data Representation Char Unsigned char Short Unsigned short Int Unsigned int Long Unsigned long long long Unsigned long long Pointer function pointer Float Double long double fract16 fract32 8 bits signed 8 bits unsigned 16 bits signed 16 bits unsigned 32 bits signed 32 bits unsigned 32 bits signed 32 bits unsigned 64 bits signed 64 bits unsigned 32 bits 32 bits 32 bits 64 bits 64 bits 16 bits signed 32 bits signed 8-bit two’s complement 8-bit unsigned magnitude 16-bit two’s complement 16-bit unsigned magnitude 32-bit two’s complement 32-bit unsigned magnitude 32-bit two’s complement 32-bit unsigned magnitude 64-bit two’s complement 64-bit unsigned magnitude 32-bit two’s complement 32-bit two’s complement 32-bit IEEE single-precision 64-bit IEEE double-precision 64-bit IEEE Q1.15 fraction format Q1.31 fraction format sizeof return in Number of Bytes 1 1 2 2 4 4 4 4 8 8 4 4 4 8 8 2 4 Table 5- 4. Data types supported by Blackfin DSP processor and VisualDSP++ integrated development environment. It is important to note that the floating-point and 64-bit data types are implemented using software emulation. Thus, it must be expected to run more slowly than hardwaresupported native data types. The emulated data types are float, double, long double, long long and unsigned long long. The fract16 and fract32 are not actually intrinsic data types— they are typedefs to short and long, respectively. In C, built-in functions must be used to do basic arithmetic operations (See “Fractional Value Built-In Functions in C++” in [12]). The following operation in C program fract16*fract16 will not lead a correct result. This is consequence of limitations of C programming language that does not support function overloading. Therefore “*” built in operator in “fract16*fract16” invokes standard multiplication with short data type operands8. Because fractional arithmetic uses slightly different instructions to normal arithmetic, one cannot normally use the standard C operators on fract data types and get the right result. The built-in functions described here to work with fractional data. The fract.h header file provides access to the definitions for each of the built-in functions that support fractional values. These functions have names with suffixes 8 Note that fract16 data types are not intrinsic data types rather they are typedef to short. 43 D I G I T A L S I G N A L R E P R E S E N T A T I O N S  _fr1x16 for single fract16,  _fr2x16 for dual fract16, and  _fr1x32 for single fract32. All the functions in fract.h are marked as inline, so when compiling with the compiler optimizer, the built-in functions are inlined. The list of build in functions used for fractional 16-bint data types, fract16, with a brief description is given in the Table 5- 5 below: Built in functions for fract16 operands Description fract16 add_fr1x16(fract16 f1,fract16 f2) fract16 sub_fr1x16(fract16 f1,fract16 f2) fract16 mult_fr1x16(fract16 f1,fract16 f2) fract16 multr_fr1x16(fract16 f1,fract16 f2) fract32 mult_fr1x32(fract16 f1,fract16 f2) fract16 abs_fr1x16(fract16 f1) fract16 min_fr1x16(fract16 f1, fract16 f2) fract16 max_fr1x16(fract16 f1, fract16 f2) fract16 negate_fr1x16(fract16 f1) fract16 shl_fr1x16(fract16 src, short shft) fract16 shl_fr1x16_clip(fract16 src, short shft) fract16 shr_fr1x16(fract16 src, short shft) 44 Performs 16-bit addition of the two input parameters (f1+f2) Performs 16-bit subtraction of the two input parameters (f1-f2) Performs 16-bit multiplication of the input parameters (f1*f2). The result is truncated to 16 bits. Performs a 16-bit fractional multiplication (f1*f2) of the two input parameters. The result is rounded to 16 bits. Whether the rounding is biased or unbiased depends on what the RND_MOD bit in the ASTAT register is set to. Performs a fractional multiplication on two 16-bit fractions, returning the 32-bit result. Returns the 16-bit value that is the absolute value of the input parameter. When the input is 0x8000 (2’s complement representation of largest negative number), saturation occurs and 0x7fff is returned. Returns the minimum of the two input parameters. Returns the maximum of the two input parameters. Returns the 16-bit result of the negation of the input parameter (-f1). If the input is 0x8000 (2’s complement representation of largest negative number), saturation occurs and 0x7fff is returned. Arithmetically shifts the src variable left by shft places. The empty bits are zero-filled. If shft is negative, the shift is to the right by abs(shft) places with sign extension. Arithmetically shifts the src variable left by shft (clipped to 5 bits) places. The empty bits are zero filled. If shft is negative, the shift is to the right by abs(shft) places with sign extension. Arithmetically shifts the src variable right by shft places with sign extension. If shft is negative, the shift is to the left by abs(shft) places, and the empty bits are zero-filled. D I G I T A L S I G N A L R E P R E S E N T A T I O N S fract16 shr_fr1x16_clip(fract16 src, short shft) fract16 shrl_fr1x16(fract16 src, short shft) fract16 shrl_fr1x16_clip(fract16 src, short shft) int norm_fr1x16(fract16 f1) Arithmetically shifts the src variable right by shft (clipped to 5 bits) places with sign extension. If shft is negative, the shift is to the left by abs(shft) places, and the empty bits are zero-filled. Logically shifts a fract16 right by shft places. There is no sign extension and no saturation – the empty bits are zero-filled. Logically shifts a fract16 right by shft (clipped to 5 bits) places. There is no sign extension and no saturation – the empty bits are zero-filled. Returns the number of left shifts required to normalize the input variable so that it is either in the interval 0x4000 to 0x7fff, or in the interval 0x8000 to 0xc000. In other words: fract16 x; shl_fr1x16(x,norm_fr1x16(x)); returns a value in the range 0x4000 to 0x7fff, or in the range 0x8000 to 0xc000. Table 5- 5. Built in functions for 16-bit fractional, fract16, operands. The list of build in functions used for fractional 32-bit data types, fract32, with a brief description is given in the Table 5- 6 below: Built in functions for fract32 operands fract32 add_fr1x32(fract32 f1,fract32 f2) fract32 sub_fr1x32(fract32 f1,fract32 f2) fract32 mult_fr1x32x32(fract32 f1,fract32 f2) fract32 mult_fr1x32x32NS(fract32 f1,fract32 f2) fract32 abs_fr1x32(fract32 f1) fract32 min_fr1x32(fract32 f1, fract32 f2) 45 Description Performs 32-bit addition of the two input parameters (f1+f2) Performs 32-bit subtraction of the two input parameters (f1-f2) Performs 32-bit multiplication of the input parameters (f1*f2). The result (which is calculated internally with an accuracy of 40 bits) is rounded (biased rounding) to 32 bits Performs 32-bit non-saturating multiplication of the input parameters (f1*f2). This is somewhat faster than mult_fr1x32x32. The result (which is calculated internally with an accuracy of 40 bits) is rounded (biased rounding) to 32 bits. Returns the 32-bit value that is the absolute value of the input parameter. When the input is 0x80000000 (2’s complement representation of largest negative number), saturation occurs and 0x7fffffff is returned. Returns the minimum of the two input parameters. D I G I T A L S I G N A L R E P R E S E N T A T I O N S fract32 max_fr1x32(fract32 f1, fract32 f2) fract32 negate_fr1x32(fract32 f1) fract32 shl_fr1x32(fract32 src, short shft) fract32 shl_fr1x32_clip(fract32 src, short shft) fract32 shr_fr1x32(fract32 src, short shft) fract32 shr_fr1x32_clip(fract32 src, short shft) fract16 sat_fr1x32(fract32 f1) fract16 round_fr1x32_clip(fract32 f1) Returns the maximum of the two input parameters. Returns the 32-bit result of the negation of the input parameter (-f1). If the input is 0x80000000 (2’s complement representation of largest negative number), saturation occurs and 0x7fffffff is returned. Arithmetically shifts the src variable left by shft places. The empty bits are zero-filled. If shft is negative, the shift is to the right by abs(shft) places with sign extension. Arithmetically shifts the src variable left by shft (clipped to 5 bits) places. The empty bits are zero filled. If shft is negative, the shift is to the right by abs(shft) places with sign extension. Arithmetically shifts the src variable right by shft places with sign extension. If shft is negative, the shift is to the left by abs(shft) places, and the empty bits are zero-filled. Arithmetically shifts the src variable right by shft (clipped to 5 bits) places with sign extension. If shft is negative, the shift is to the left by abs(shft) places, and the empty bits are zerofilled. If f1>0x00007fff (216-1), it returns 0x7fff (216-1). If f1<0xffff8000 -(216-1), it returns 0x8000 -(216-1). Otherwise, it returns the lower 16 bits of f1. Rounds the 32-bit fract to a 16-bit fract using biased rounding. Returns the number of left shifts required to normalize the input variable so that it is either in the interval 0x40000000 to 0x7fffffff, or in the interval 0x80000000 to 0xc0000000. In other words: fract32 x; shl_fr1x32(x,norm_fr1x32(x) ); int norm_fr1x32(fract32 f1) returns a value in the range 0x40000000 to 0x7fffffff (positive), or in the range 0x80000000 to 0xc0000000 (negative). Returns the top 16 bits of f1—it truncates f1 to 16 bits. fract16 trunc_fr1x32(fract32 f1) Table 5- 6. Built in functions for 32-bit fractional, fract32, operands. 46 D I G I T A L S I G N A L R E P R E S E N T A T I O N S VisualDSP++ also provides support for European Telecommunications Standards Institute (ETSI) support routines. For further information consult [ADSPBF53x/BF56x Blackfin Processor Programming Reference] manual. 3.3 C++ – Language Support for Fractional Data Types In C++, for fract data, the classes “fract” and “shortfract” define the basic arithmetic operators. This in turn means that “*” operation is overloaded and will invoke a proper hardware operation on fract operands. The fract class uses a fract32, C-language type storage for the fractional 32-bit data, while shortfract uses fract16 for the fractional 16-bit data. Instances of the shortfract and fract class can be initialized using values with the “r” suffix, provided they are within the range [-1,1). The fract class is represented by the compiler as representing the internal type fract. For example #include <fract> int main () { fract X = 0.5r; } Instances of the shortfract class can be initialized using “r” values in the same way, but are not represented as an internal type by the compiler. Instead, the compiler produces a temporary fract, which is initialized using the “r” value. The value of the fract class is then copied to the shortfract class using an implicit copy and the fract is destroyed. The fract and shortfract classes contain routines that allow basic arithmetic operations and movement of data to and from other data types. The example below shows the use of the shortfract class with * and + operators. // C++ initialization of the data with fractional // constants. #include <shortfract> #include <stdio.h> 47 D I G I T A L S I G N A L R E P R E S E N T A T I O N S #define N 20 shortfract x[N] = { .5r,.5r,.5r,.5r,.5r, .5r,.5r,.5r,.5r,.5r, .5r,.5r,.5r,.5r,.5r, .5r,.5r,.5r,.5r,.5r }; shortfract y[N] = { .0r,.1r,.2r,.3r,.4r, .5r,.6r,.7r,.8r,.9r, .0r,.1r,.2r,.3r,.4r, .5r,.6r,.7r,.8r,.9r }; shortfract fdot(int n, shortfract *x, shortfract *y) { int j; shortfract s; s = 0; for (j=0; j<n; j++) { s += x[j] * y[j]; // Overloaded “*” operator } return s; } int main(void) { fdot(N,x,y); } 3.4 C vs. C++ Important Distinctions If coding in C mode fractional constants can be used to initialize the fractional variables. Note that fract16 and fract32 are typedef of short int and long int built in data types. Initialization is accomplished by normalizing the fractional number to the range determined by Q format. 48 D I G I T A L S I G N A L R E P R E S E N T A T I O N S Example of Q0.159. Conversion from float to fract16: fract16 x= (0.75 * 32767.0); // fractional representation of 0.75 //Use of built in conversion function float_to_fr16(float) fract16 y = float_to_fr16(19571107.945) In the second example the number will be saturated to frac16 precision; that is to 32767. This implies that the numbers that are converted must be scaled to fit the corresponding data type range (e.g., 216-1=32767 for 16 bit or 232-1=2147483647 for 32 bit data). In C, no special conversion is needed from 16 bit signed integer to fract16. However, proper functions that perform operations on fract data must be used since in C mode there is no operator overloading. In C++ due to overloading of the built in operators the proper operations will be invoked as long as data types are declared properly. To avoid potential problems it is advisable to always use fractional functions explicitly even in C++ programming mode when using fract data types. Note that VisualDSP++ for Blackfin DSP processor family supports only singed fractional data types as presented earlier in this chapter. 9 49 D I G I T A L S I G N A L R E P R E S E N T A T I O N S References [1] Jayant & Noll, “Digital Coding of Waveforms-Principles and Applications to Speech and Video”, Chapter 3, Sampling and Reconstruction of Bandlimited Waveforms, Prentice Hall, 1984. [2] Oppenheim, Schafer & Buck, “Discrete-Time Signal Processing“, Chapter 6, Overview of Finite –Precision Numerical Effects, Prentice Hall, 1999 [3] Ingle & Proakis, “Digital Signal Processing using MATLAB”, Chapter 9, Finite Word-Length Effects, Thomson, 2007. [4] Udo Zölzer, “Digital Audio Signal Processing, Chapter 2, Quantization, Wiley, 1998. [5] J. H. Conway, R. K. Guy, “The Book of Numbers”, [6] D. Knuth, “The Art of Computer Programming”, Volume 2, Seminumerical Algorithms, Third Edition, Addison-Wesley, 1997. [7] http://en.wikipedia.org/wiki/Real_number [8] Kondoz, “Digital Speech, Coding for Low Bit Rate Communication Systems, Wiley, 2004. [9] Jensen, Batina, Hendriks & Heusdens, “A Study of the Distribution of Time-Domain Speech Samples and Discrete Fourier Coefficients”, Proceedings of SPS-DARTS, 2005. [10] Quatieri, “Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall, 2002. [11] Kuo & Gan, “Digital Signal Processors-Architectures, Implementations, and Applications”, Chapter 3, Implementation Considerations, Prentice Hall, 2005. [12] VisualDSP++ C/C++ Compiler and Library Manual for Blackfin Processors, Analog Devices, Inc. One Technology Way Norwood, Mass. 02062-9106, www.analog.com 50 Index 51

Number Systems - My FIT - Florida Institute of Technology

Related documents

Products

Support

Number Systems - My FIT - Florida Institute of Technology

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib