Performance Enhancement of Video Compression Algorithms with SIMD Report Submitted By: Shamik Valia Saket Jamkar 1 Introduction Video Compression algorithms have a large number of applications ranging from Video Conferencing to Video on Demand to Video phones. Video Compression standards (such as the MPEG -1, 2, 4, 7) and Teleconferencing standards (such as the H.2XX) are vital algorithms used in these and other multimedia applications, whose performance is very critical given the high data rates that are common for video applications. The timing constraints with such high data rates can be challenging enough even for custom Videocodecs and overwhelming for some of the state-of-the-art superscalar processors. Performing these operations real-time isn’t easy on most platforms if image resolutions of acceptable quality are desired. The algorithms, however, consist of repetitive and regular operations by nature, which could benefit greatly from the use of some architectures that are better able to perform such repetitive tasks efficiently. In recent years, general purpose microprocessors have also been endowed with functional units capable of Single Instruction Stream Multiple Data Stream (SIMD) operation. This project attempts to study the speedup achievable for the most critical parts of these algorithms by utilizing the Streaming SIMD Extensions 2 from the Intel Pentium 4 processors. We also deal with the improvement schemes for the DCT algorithm on the SSE2 architecture. 2 Basic steps used in Video Compression The Video Compression algorithm utilized in numerous standards (such as MPEG 1, 2 H.263) usually consists of the following steps: 1. Motion Estimation 2. Motion Compensation and Image Subtraction 3. Discrete Cosine Transform 4. Quantization 5. Run Length Encoding 6. Entropy Coding – Huffman Coding We examine each of these steps in greater detail in this section. 2.1 Motion Estimation Motion estimation is the process of calculating motion vectors by finding matching blocks in the future frame corresponding to blocks in the current frame. Motion estimation helps in detecting the temporal redundancy. Various search algorithms have been devised for estimating motion. The basic assumption underlying these algorithms is that only translational motion can be compensated for Rotational motion and zooming cannot be estimated by using block based search algorithms. It is known to be the most crucial and computationally intensive process in the video compression algorithm. P Search Region (-p,p) Current block Figure 2.1.1 Search region (-p, p) Since most of the video streams have a frame rate ranging from 15 to 30 frames per second, there is never a very large motion of any object between two successive frames. Therefore most search algorithms search for matching block in the neighborhood of the position of the current block in the next frame. The region where matching block is searched for is called the search region. Search region around a block is shown in the figure 2.1.1. The choice for the value of p will depend upon the type of broadcast that has to be sent. For fast-moving videos such as sports events a higher value of p such as 16 or 32 may be used. On the other hand for broadcasts with less motion such as a news-telecast a smaller value of p such as 4 or 8 may be used. (x,y) (x+u,y+v) u,v Figure 2.1.2 Motion vector and best match The task to be performed by the search algorithm is to find the best match for a block in the current frame in the next frame. A typical block size is 8x8 or 16x16 pixels. The quality of match found will depend on the value of Mean Absolute Error, more commonly known as MAE between the blocks. This is the average absolute pixel-wise difference between two blocks, reference block in the current frame and probable match found in the next frame. The matching block is figured out on the basis of the magnitude of the value of its mean error. Smaller the magnitude better is the match. The displacement of the block with the minimum MAE is taken as the motion vector. Formula for MAE is given by: MAE = (1/MN) | C(x+k,y+l) – R(x+I+k,y+j+l)| Next we explain the two search algorithms that were used in this project. 2.1.1 Full Search Full search is an exhaustive search algorithm. Full search is the simplest method to find the motion vector for each block; in it the MAE(i,j) is found at each point in the search region. Thus a search for the match block is made in the complete (-p, +p) range in the future frames for every block of the current frame. For each motion vector, there are (2p) 2 search locations. At each search location (i,j) we compare N x M pixels. Each pixel comparison requires three operations, namely: a subtraction, an absolute value calculation and an addition. We ignore the cost of accessing the pixels C(x + k, y + l) and R(x + i + k, y + j + l). Thus the total complexity per block is (2p) 2 x MN x 3 operations. For a picture resolution of I x J, and a picture rate of F pictures/second, the overall complexity is IJF/MN x (2p) 2 x MN x 3 operations per second. But this makes it a very intensive method computationally. CPU time for full search is the highest of all the algorithms. At the same time the accuracy of Full search is also highest and the best match for every block in the current frame is always found. Full search, therefore is a benchmark for comparison of the quality of a search algorithm, which as was previously mentioned depends on CPU time and accuracy. There is a tradeoff between the efficiency of the algorithm and the quality of the prediction image. Keeping this trade-off in mind a lot of algorithms have been developed. 2.1.2 Three Step Search Three step search became very popular because of its simplicity and also robust and near optimal performance. It searches for the best motion vectors in a coarse to fine search pattern. The algorithm may be described as: (Refer to figure 2.1.3) Step 1: An initial step size is picked. Eight points at a distance of step size from the centre (around the centre point) are picked for comparison. Step 2: The step size is halved. The centre is moved to the point with the minimum distortion. Steps 1 and 2 are repeated till the step size becomes smaller than 1. A particular path for the convergence of this algorithm is shown below: Points chosen for first stage Points chosen for second stage Points chosen for third stage Figure 2.1.3 Example path for convergence of Three Step Search 2.2 Motion Compensation and Image Subtraction The process of Motion Estimation and Motion Compensation is similar to DPCM. The idea is to reduce the bandwidth required for the video by sending only the difference frames instead of the actual frames. The motion vectors produced during Motion Estimation are utilized in the Motion Compensation process in order to produce the predicted image in the encoder just like it would be produced in the Decoder. The two images (current frame and the motion compensated frame) are now subtracted and the difference is sent to the receiver along with the motion vectors. Thus the decoder can produce the exact copy of the future frame by first motion compensating the current frame using the motion vectors and then adding the difference image. The block diagram of the Encoder is given below in Figure 2.2.1 in order to illustrate the idea. Frame (n) I(x,y,t) Motion Estimation Frame (n+1) I(x,y,t+1) u,v Motion Compensation E(x,y,t) = I(x,y,t) – I(x-u,y-v,t+1) DCT coding Fig. 2.2.1 Block Diagram of Video Encoder 2.3 Discrete Cosine Transform (DCT) DCT based image coding is the basis for almost all the image and video compression standards. Discrete Cosine Transform is a derivative of the Discrete Fourier Transform (DFT), which is encountered very commonly in Digital Signal Processing. The fundamental operation performed by DCT is to transform the space domain representation of an image to a spatial frequency domain (known as DCT domain). The formula for DCT is given below: Y(k,l)= C(k) C(l)/4 Xij cos((2i+1)k) cos((2j+1)l C(k) = (½)½ if k = 0 C(k) = 1 otherwise The DCT transformation can be viewed as the process of finding for each waveform, the corresponding weight Y(k,l) so that the sum of 64 waveforms scaled by the corresponding weights Y(k,l) yields the reconstructed version of the original 8 x 8 block. Energy compaction of DCT is among the highest next only to the Karhunen- Loeve Transform. This means that the information can be compressed to a very high degree with DCT, which is why DCT is commonly used. At the same time DCT also minimizes the block artifact that is present in many other transforms due to the favorable periodic nature of DCT. DCT, in principle, is a lossless process. However, due to the finite word-lengths in a microprocessor, there is some loss of information due to rounding and truncating of calculated DCT values. This loss of information is irreversible. 2.4 Quantization The human eye is not sensitive to the high frequency content in an image. Therefore removal of these spatial frequencies does not lead to any perceptible loss in image quality. This is the basic principle behind quantization. The spatial frequency content of the image is obtained by using the DCT operation, which is followed by a removal of the high frequency content that is the quantization process. The JPEG standard recommends standard values of quantization tables which are used to deemphasize higher frequencies in the DCT image. Quantization is a lossy process and some data is lost during quantization. This loss of information is irreversible. 2.5 Run Length Encoding (RLE) Run-length encoding is the next stage of the compression process. It encodes the runs of zeroes. If pixel values are correlated to their neighbors, then there will be sequences of the same value. Instead of coding all the repeat values, just encode the first value and then give the run length of the sequence. Intuitively, one can understand how RLE can help in achieving compression. Suppose the data is 00000…0(ten times). Now instead of writing ten zeroes one can send only 0-10, which could be taken to mean that a zero occurs 10 times. This is how compression is achieved in Run-Length Encoding. Runs of zeroes are encoded in a 16 bit or 8 bit format. A higher compression can be achieved in Run-Length encoding if we somehow obtain longer strings of zeroes. This is achieved by performing RLE in a zigzag manner on a block. In the DCT image the higher frequency content is always found towards the lower right hand corner of the DCT image while the lowest frequencies are in the upper left hand corner of the image. During quantization the higher frequencies are reduced to zero and therefore the values in the lower right hand side are mostly zero. Therefore by performing RLE in a zig-zag manner, we try to obtain runs of zeroes out of the lower right hand side of the DCT domain representation. 2.6 Huffman Encoding Huffman encoding is a form of entropy encoding and it is based on Shannon’s Information theory. The fundamental idea behind Huffman encoding is that symbols, which occur more frequently, should be represented by fewer bits, while those occurring less frequently should be represented by more number of bits. This scheme is similar to the one utilized in Morse code. Shannon has proved that the entropy of the total message gives the most efficient code, with minimum average code length, for sending a message. Given n symbols S1 to Sn-1 with probabilities of occurrence P1 to Pn-1 in a certain message, the entropy of the message will be given by Entropy = Pi log2 (1/ Pi) Huffman encoding attempts to minimize the average number of bits per symbol and try to get a value close to entropy. Example: We describe the algorithm for Huffman encoding with the help of an example. Consider 4 symbols with probabilities as shown in the first column of the table. Symbol Probability Iteration 1 Iteration 2 Length in bits S1 0.4 (1) 0.4 (1) 0.6 (0) 1 S2 0.3 (00) 0.3 (00) 0.4 (1) 2 S3 0.2 (010) 0.3 (01) S4 0.1 (011) 3 3 Table 2.6.1 Huffman encoding method Step 1: Sort the probabilities and arranged in descending order as shown in the column marked “Probability”. Step 2: Add the probabilities of the last two symbols and add them to the next column after sorting the values. Step 3: Continue steps 1 and 2 until only two symbols remain. Step 4: Assign bit 0 to upper symbol and 1 to the lower symbol. (or vice-versa…but then this format should be followed throughout the process.) Step 5: Trace back the probabilities according to where they have come from in the previous column and append a 0 or 1 depending on the format chosen above. Step 6: Follow this procedure up to first column and you have the variable length code ready. Huffman encoding isn’t implemented in this manner in the JPEG image compression standard. The standard code tables for Huffman coding are defined in the standard for the DC values and the non-DC values as well. These values are looked up both for encoding and decoding. Huffman code is a prefix code and hence it can be uniquely decoded. The other alternative methods for entropy coding or source coding are Shannon-Fano encoding, and arithmetic coding. Arithmetic coding has even been adopted in the JPEG 2000 standard for entropy coding after IBM agreed to release its patent on this technique for the JPEG 2000 standard. 3. Key Bottlenecks After performing an analysis of the Video compression algorithms and a survey in the literature to improve its performance we were able to identify two algorithms (i.e. Motion estimation and DCT), which are the most resource intensive and in which a very high proportion of the time in a video compression algorithm is spent. Motion Estimation requires making use of highly repetitive methods applied to the whole image. DCT is essentially a matrix multiplication loop which is to be performed on every 8 or 16 pixels of the image. Also with the increase in image resolution the problem becomes even worse as the loop iterations will increase and will require more computational resources. However, it can be seen that there isn’t any data dependence between the various data elements that are used in the algorithm. Therefore it is possible to try and improve performance of these programs by exploiting parallelism inherent in these media algorithms and running different data points in parallel to obtain higher throughput. SIMD architecture exploits this parallelism by use of increase datapath size and performing the same operations on the different data point (in our case pixels). 4. SIMD Architecture: Usually, processors process one data element in one instruction, a processing style called Single Instruction Single Data, or SISD. In contrast, processors having the SIMD capability process more than one data element in one instruction. The Single Instruction Stream Multiple Data Stream (SIMD) Architectures perform operations on many elements in a lockstep fashion. The same instruction is performed on different data elements computed by different functional units. The Intel’s MMX/SSE/SSE2, AMD’s 3DNow, Power PC’s Altivec ISA extensions are testimonial to the benefits of SIMD support to traditional superscalar. 4.1 Intel’s Streaming SIMD Extensions SIMD Extensions for the IA-32 ISA began with the Multimedia Extensions (MMX) in 1997 for the Pentium processor. MMX datapath of 64 bits subword parallel ALU’s for bytes, words and doublewords enhanced its performance on multimedia benchmarks. However, these instructions had a very limited function, in that only integer data-types could be handled. Also since the MMX instructions utilized the floating point registers, it was very hard to inter-mingle floating point and MMX instructions. Streaming SIMD Extensions (SSE) from the Pentium III marked the advent of 68 new instructions to the IA-32 ISA, in particular the MMX. The biggest winners from the new instructions were applications that handled 3D or streaming media, as applying identical instructions to multiple pieces of code was now handled in parallel. AMD wasn't idle over this time though, and introduced 3DNow! to the world. This much catchier-sounding set offered capabilities similar to those made possible by SSE, but was incompatible with it. The SSE2 technology from the Pentium-4, introduced new Single Instruction Multiple Data (SIMD) double-precision floating-point instructions and new SIMD integer instruction into the IA-32 Intel architecture. The 128-bit SIMD integer extensions are a full superset of the 64-bit integer SIMD instructions, with additional instructions to support more integer data types, conversion between integer and floating-point data types, and efficient operations between the caches and system memory. These instructions provide a means to accelerate operations typical of 3D graphics, real-time physics, spatial (3D) audio, video encoding/decoding, encryption, and scientific application. 4.1.1 SSE Vs MMX MMX and SSE, both of which are extensions to existing architectures, share the concept of SIMD, but they differ in the data types they handle, and in the way they are supported in the processor. MMX instructions are SIMD for integers, while SSE instructions are SIMD for singleprecision floating-point numbers also. MMX instructions operate on two 32-bit integers simultaneously, while SSE instructions operate on four 32-bit floats simultaneously. A major difference between MMX and SSE is that no new registers were defined for MMX, while eight new registers have been defined for SSE. Each of the registers for SSE is 128 bits long and can hold four single-precision floating-point numbers (each being 32 bits long). The arrangement of the floating-point numbers in the new data type handled by SSE is illustrated in Figure 4.1. Figure 4.1: Arrangement of numbers in the new data type. The immediate question is: Where did the registers for MMX come from? The MMX registers were allocated out of the floating-point registers of the floating-point unit. A floating-point register is 80 bits long, of which 64 bits were used for an MMX register. A limitation of this architecture is that an application cannot execute MMX instructions and perform floating-point operations simultaneously. Additionally, a large number of processor clock cycles are needed to change the state of executing MMX instructions to the state of executing floating-point operations and vice versa. SSE does not have such a restriction. Separate registers have been defined for SSE. Hence, applications can execute SIMD integer (MMX) and SIMD floating-point (SSE) instructions simultaneously. Applications can also execute non-SIMD floating-point and SIMD floating-point instructions simultaneously. The arrangement of the registers in MMX and SSE is illustrated in Figure 4.2. Figure 4.2(a) illustrates the mutually exclusive floating-point and MMX registers, while Figure 4.2(b) illustrates the SSE registers. Figure 4.2: Registers in MMX and SSE. MMX and SSE have one more similarity: Both have eight registers. MMX registers are named mm0 through mm7, while SSE registers are named xmm0 through xmm7. For the purpose of our experiment we make use of the SSE2 extensions. 4.2 SSE2 Coding Techniques There is limited compiler support available for the SIMD ISA extension. As a result to make use of the rich features provided by this extension we need to go through different programming techniques. One can use one of the following techniques to code programs with SSE2. a) Assembly level programming b) Intrinsics c) Vector Class Library Advantages of using Intrinsics and the Vector Class Library is that the Intrinsics and Vector Classes free the programmer from managing registers while ensuring easier maintenance and modularization of code. The compiler optimizes instruction scheduling and register allocation and hence the executable runs faster. Each computation and data manipulation assembly instruction has a corresponding intrinsic that implements it directly. The intrinsic in SSE2 contain suffixes to indicate the datatype operated on by instructions. - p, pd, ps suffix indicates a packed, packed double, packed single precision floating point operation - s, sd, ss indicates a scalar, scalar double or scalar single precision floating point operation - i , si, su, pi, pu, epi, epu indicates an integer, 64-bit signed or unsigned integer, 128 bit (ep) signed or unsigned extended precision operation for 8, 16, 32 or 64 bits. To use the intrinsics library, the file xmmintrin.h must be included. Thus we chose to utilize the Intrinsics style of coding for Motion Estimation Algorithms. We chose the Intel’s C++ compiler over the Microsoft’s Visual Studio pack to compile our motion estimation algorithms. For most of the parts we made use of normal C code constructs. However in cases where we could exploit parallelism with SIMD we made use of SSE2 intrinsics to indicate to the compiler its use. 4.3. Motion Estimation We perform motion estimation for full search and three step search for both the 16x16 and 8x8 block size and compare performance. The complete sample of C code for all the programs are provided in the appendix. Here we present the instrinic optimization done to incorporate the SSE2 features. 4.3.1Code Snippets Blockdiff is the main computationally intensive function call in the program. It also can make use of the SSE2 features to improve its performance. We change the code using intrinsics to employ the SSE2 datapath. Snippet below provides the blockdiff function call. { int blockdiff(int x1,int x2,int y11,int y22) unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m128i *b1,*b2,m1; union mmx { __m128i m; short int x[8]; }m; …. …. …. // type casting pointers. b1 = (__m128i*)block1; b2 = (__m128i*)block2; //SAD for 16 bytes. m1 = _mm_sad_epu8(*b1,*b2); m.m = m1; totaldiff = totaldiff + m.x[3] + m.x[7]; } } Figure 4.3 Blockdiff function to process 16 x 16 blocks. Figure 4.3 above shows the SSE code for the blockdiff() function which finds the difference between two blocks located at (x1, y11) and (x2, y22) . The top part shows the declarations inside the function, while the bottom part shows the calculation of the difference using the SSE intrinsic. We define a union called mmx, which can be used to address the m register of the mmx datatype “__m128i” and as an array of 8 intgers as well. This __m128i register consists of 16 8-bit integer values. The block1 and block2 arrays will contain the 16 8-bit pixel values from the image. These are typecast into the __m128i format and put into the locations pointed by b1 and b2. Next the __mm_sad_epu8() instruction finds the sum of differences of these 16 values directly and places it in the m1 register. Totaldiff adds up the total difference from previous iterations and this one. int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m64 *b1,*b2,m1; union mmx { __m64 m; int x[2]; }m; …. …. …. // type casting pointers. b1 = (__m64*)block1; b2 = (__m64*)block2; //SAD for 16 bytes. m1 = _m_psadbw(*b1,*b2); m.m = m1; totaldiff = totaldiff + m.x[0]; } Figure 4.4 Snippet from blockdiff function for 8 x 8 blocks. Figure 4.4 above shows the blockdiff function code for taking differences between 8 x 8 blocks in a similar manner to the one above. The operations performed are similar but the datatype and intrinsic used are different. The datatype used is the __m64 type, which consists of 8 8-bit values. The intrinsic used to calculate the sum of differences is the _m_psadbw() operation. Totaldiff will again contain the accumulated difference from all the iterations. 4.3.2. Results Full Search Without SSE With SSE 4 secs 1 secs 23 secs 6 secs 16 x 16 Full Search 8x8 Three Step 3 secs 1 secs 12 secs 3 secs 16 x 16 Three Step 8x8 Table 1 Timing information for the various programs. Figure 4.4.1 Frame 4 and 5 of the news.qcif video-stream Figure 4.4.2 Motion Compensated Frame produced from Frame 4 of the news.qcif video-stream with block size of 8 x 8 and 16 x 16 respectively Figure 4.4.3 Part of frame (4) Figure 4.4.5 Part of Predicted Figure 4.4.4 Part of frame (5) Figure 4.4.6 Part of Predicted with We blocknotice size ofthat the We draw from the results frame given (5) above. frame (5)the withfollowing block sizeconclusions of 8 x 8 16x16 speedup is by a factor of 3-4 for most programs with SSE. Also the 8 x 8 block programs for both algorithms take longer to execute compared to the 16 x 16 block programs. The reason is that the loop overhead for the programs goes up, even though the number of addition or subtractions to be performed are the same. From the images we see that the 8 x 8 blocks perform a better job at matching than the 16 x 16 blocks. The predicted images after motion compensation show that the 8 x 8 blocks are better suited for tracking movement of the smaller image regions with these algorithms. 4.4 DCT Our second candidate algorithm which is highly computationally intensive is the DCT. It is essentially matrix multiplication of an image block by a DCT constant multiplication matrix. This structure can also make use of the SIMD architecture to improve performance. For the following section we present a few suggestions that would improve the performance of the DCT algorithm. However, we don’t go into the performance comparisons of the algorithms due to lack of the capability to add additional functionality to the compiler tool. Below is a code snippet for the DCT algorithm. void DCT (int InBlock[][8], int OutBlock[][8]) { int TempBlock[8][8], CosTrans[8][8]; /*TempBlock = InBlock * CosBlock^T*/ MatrixMult(InBlock, CosTrans, TempBlock); Transpose(CosBlock, CosTrans); /*OutBlock = CosBlock * TempBlock*/ MatrixMult(CosBlock, TempBlock, OutBlock); } Figure 4.4.1 This DCT code could be further enhanced using the SIMD support provided by Intel’s SSE2 To illustrate this fact let us consider a 4 x 4 multiplication using traditional methods and then by using the SIMD architecture. (a) (b) (c) (d) Figure 4.4.2:Matrix Multiplication for computation of a single element.Parts a,b,c,d show the various steps for obtaining a single result for matrix multiplication The traditional method requires that the row and column elements of the two matrices that are multiplied be accessed one at a time and a MAC operation performed. This will require 64 sequential operations of accessing the elements from memory and multiply accumulate. However, when we employ the use of the SIMD architecture it will require 16 operations on the SIMD architecture. The illustration is given below. (a) (b) (c) (d) Figure 4.4.3:Matrix Multiplication for four elements.Parts a,b,c,d show the computation on the SIMD platform Essentially, because of the large datapath of the SSE architecture, it is possible to concurrently perform operations that are independent from each other. Therefore, partial products for 4 different elements of the matrix are carried on in parallel. Hence improving performance at the cost of extra hardware. We believe this will improve performance of the DCT code till upto 4 times the original sequential code for DCT 4.4.2 Specialized hardware support for DCT on the SIMD architecture Using the SIMD architecture of the SSE2 for the implementation of 8 point 1-D DCT does improve performance over the use of simple C code implementation of 1-D DCT.A specialized accelerator for DCT incorporated on the SIMD would improve performance further. The motivation of this study is therefore to study the trade off between the cost in terms of hardware v/s performance improvement obtained by using a dedicated accelerator for 1-D DCT implementation. Choice of the DCT accelerator was highly driven by its capability to scale with the SIMD architecture. Hence implementing distributed arithmetic for DCT which is easily scalable with the SIMD architecture. 4.2.2.1 DCT Implementation on hardware: The 2-D DCT has been recognized as the most cost effective techniques among various transform coding schemes for image compression. The DCT is one of the orthogonal transforms and the N x N 2-D DCT is defined as follows N 1 N 1 2 (2i 1)u (2 j 1)u X (u, v) C (u).C (v). x(i, j ). cos cos N 2N 2N i 0 j 0 where x(i,j) (i,j =0,1,2,…N-1) is the pixel data, X(u,v)(u,v=0,1,2,……..N-1) is the transformed coefficient, and C(0)=1/ 2 ,C(u)=C(v)=1 if u,v 0. The 2-D DCT unit is comprised of two 1-D DCT units and a transpose operation. This 2D DCT is separated into two 1-D DCT’s by the row-column decomposition technique. The input data are fed into the first DCT unit where 1-D DCT is calculated in row order. Then the intermediate data is transposed. Finally, the transposed data are inputted to the second 1-D DCT unit and processed in column order. The recursive fast DCT algorithm is used to calculate the eight point DCT as shown X 0 A A A A x0 x7 X 2 1 B C C B . x1 x6 X 4 2 A C A A x 2 x5 C B B C x3 x 4 X 6 X1 D E F G x0 x7 X 3 1 E G D F . x1 x6 X 5 2 F D G E x 2 x5 G F E D x3 x 4 X 7 A = cos 4 , B cos 8 , C sin 8 , D cos 16 , E cos 3 3 , F sin , G sin 16 16 16 where xi (i=0,1,2,….7) is the pixel data and Xu (u=0,1,2….7) is the transformed coefficient. Owing to this algorithm, the number of multiplications becomes half for the DCT. x0+ x7 x1+x6 x2+x5 x3+x4 x0-x7 x1-x6 x2-x5 x3-x4 4 ROM ROM DAP DAP DAP DAP DAP DAP DAP X2 X4 X6 X1 X3 X5 X7 16 0.5 16 + 16 16 R 0.25 16 Figure 4.2.2.1:DAP structure for the SSE2 extension The figure 4.2.2.1 shows the block diagram of the 1-D DCT processing unit on the SSE2 platform.The preprocessed values of addition and substraction can be obtains by the parallel addition and substraction instructions on the SSE2. Multiplier accumulator in the DCT core processor has been designed with the distributed arithmetic. According to the distributed arithmetic, the parallel multipliers can be eliminated from the core processor and the hardware amount is greatly reduced. Furthermore, a very high speed operation can be achieved because the critical path is formed in adder instead of multiplier. Here, we illustrate the principle of distributed arithmetic (DA).Assume the input vector is presented in N-bit two’s complement code as follows: N 1 x k bk 0 bkn .2 n n 1 The multiply accumulate in the normal way can be presented as the following equation: K y a k .x k k 1 K N 1 k 1 n 1 a k .(bk 0 bkn .2 n where ak (k=1,2,3,….K) is the multiply coefficient. Based on the distributed arithmetic y can be calculated as follows N 1 K K y a k .bkn .2 n a k .(bk 0 ) n 1 k 1 k 1 The multiply operation is implemented with a ROM that stores the precalculated partial products. Therefore, the hardware of the multiply accumulation based on the DA includes ROM and an adder that accumulates the partial products read from ROM. In the multiply–accumulate operations based on distributed arithmetic, precalculated partial products are read out from ROM’s and accumulated in a bit-wise manner from LSB’s to MSB’s. To double the processing speed, two partial products for adjacent bits can be read from the individual ROM’s at the same time. This method of calculation can be written in the following equation N /2 N /2 K K K y a k .bk ( 2 m1) .2 ( 2 m1) a k .bk .2 m .2 m a k .(bk 0 ) m 1 k 1 m 1 k 1 k 1 In this case the two adjacent bits are processed simultaneously. Thus, two ROM’s required to offer a pair of partial products for higher and lower bits in every cycle, and both have two banks for the two modes of DCT. The ROM size itself was reduced by 2 4 times by using the fast algorithm in conjunction with the DA scheme. With the present configuration of the DAP(Distributed Arithmetic processor) structure we would require a 16 ROM’s each 16 x 16 bits. Each of the DAPs will complete the operation in 8 cycles. Hence the entire DCT is calculated in 8 cycles because all the DAPs work in parallel on the SIMD architecture of SSE2. 4.4.2.2 Implementation and Results The DAP structure was implemented using Verilog and synthesis was done using the gflx-p library. The control flow for the DAP is given below: The DAP will take a total of 8 cycles to complete. For all the fractional binary bits we use the shift add method. Essentially we shift the operand right by 2 and add it every cycle as shown in the DAP figure. The Shiftadd complement method is used in case when DAP is processing the integer part of the binary fixed point number. This can be further understood by looking into the code placed in the appendix. Synthesis Results Clock period achieved : 1.8ns Area results: Combinational Area : 6660.396 Non-Combinational Area : 1128.06 Interconnect Net Area :1211.4841 Total Cell Area :7788.476 Total Area : 8999.942 5.Conclusion: SIMD extensions to the superscalar architectures have helped improve the performance of the general purpose processors on media applications. We used the motion estimation algorithm and optimized it to make use of the SIMD architecture offered by today’s modern processors. There was considerable performance improvement with the use of the wide datapath. Also, we explored into improving performance of the DCT by use of the existing ISA and by employing a dedicated hardware for the DCT implementation. The performance analysis of this extension on the ISA and its hardware trade-off remains to be seen and an agenda for future work. APPENDIX A.1 Program for the Full Search Motion Estimation Algorithm with simple C for 16 x 16 image blocks //-----------------------------------------------------------------------// PROGRAM TO IMPLEMENT FULL SEARCH //-----------------------------------------------------------------------#include<stdio.h> #include<conio.h> #include<math.h> #include<stdlib.h> #include<time.h> int blockdiff(int, int, int, int); int x1, x2, y11, y2, i,j,k, mindiff[12][10], p = 8; int far diff[12][10][257]; int sort[257], temp,point, l, motionx, motiony, col ,row, x, y; long peldiff = 0; float amad = 0.0; time_t first, second; void main() { int i1,j1,k1; FILE *fpold,*fpnew; first = time(NULL); for(j=0; j<9 ;j++) { for(i=0; i<11 ;i++) { x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 16*j; y2); for(row = 0; row < 2*p ; row++) { for(col = 0; col < 2*p; col++) { x2 = x1 - p + col; y2 = y11 - p + row; diff[i][j][(16*row) + col] = blockdiff(x1, x2, y11, } } //-----------------------------------------------------------------------//SUBPROGARM FOR SORTING THE DIFFERENCES //----------------------------------------------------------------------- // for(k=0; k<256 ; k++) { sort[k] = diff[i][j][k]; printf("\tk%d diff%d", k, diff[i][j][k]); } for(k=0; k<256 ;k++) { for(l=0; l<256 ; l++) { if(sort[l] < sort[l+1]) { temp = sort[l]; sort[l] = sort[l+1]; sort[l+1] = temp; } // // // } } mindiff[i][j] = sort[255]; printf("\nmindiff=%d", mindiff[i][j]); printf("\nsort=%d", sort[255]); getch(); for(k=0; k<255; k++) { if(diff[i][j][k] == sort[255]) { l = k; } } x = l % 16; y = (l - motionx)/16; motionx = x - 8; motiony = y - 8; // // printf("\t %d %d %d %d", x1, y11, motionx , motiony); printf("\n\t%d %d",(j*11)+i, mindiff[i][j]); //---------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE //---------------------------------------------------------------------/* fpold = fopen("C:\\ECE734\\f4.raw","rb"); fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b"); if(motionx != 0 || motiony != 0) { //SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0); fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++)//FOR WRITING REQUIRED PIXELS IN NEW IMAGE { point = fgetc(fpold); fputc(point, fpnew); } fseek(fpold,160,1); fseek(fpnew,160,1); } fclose(fpnew); fclose(fpold); */ } LOOP FOR amad = amad + sqrt(motionx*motionx + motiony*motiony); peldiff = peldiff + mindiff[i][j]; } } } amad = amad/99; printf("\nAMAD = %f", amad); printf("\nPixel Difference %ld", peldiff/99); second=time(NULL); printf("\nDifference in time %ld", second - first); getch(); //-------------------------------------------------------------------------// FUNCTION BLOCKDIFF //-------------------------------------------------------------------------int blockdiff(int x1,int x2,int y11,int y2) { int block1[16][16], block2[16][16],i1,j1,k1,ch; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; //DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME if(x2 < 0 || y2 < 0 || x2 >160 || y2 >128) { totaldiff = 10000; } else { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb"); if(fp1 == NULL || fp2 == NULL) { printf("File cannot be opened"); getch(); exit(0); } for(i1=0; i1<16 ; i1++) { for(j1=0; j1<16 ;j1++) { diff1[i1][j1] = 0; block1[i1][j1]= 0; block2[i1][j1]= 0; } } for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT ch = fgetc(fp1); //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++) { block1[i1][j1] = fgetc(fp1); } for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED } { ch = fgetc(fp1); } //BLOCK COPIED FROM SECOND FRAME for(i1=0; i1<(176*y2) + x2; i1++)//SKIP PIXELS UPTO INTIAL POINT ch = fgetc(fp2); //COPY REQUIRED PIXELS FROM BLOCK 2 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++) { block2[i1][j1] = fgetc(fp2);} for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED } {ch = fgetc(fp2);} for(i1=0; i1<16 ; i1++) { for(j1=0; j1<16 ;j1++) { diff1[i1][j1] = block2[i1][j1] - block1[i1][j1]; diff1[i1][j1] = abs(diff1[i1][j1]); totaldiff = totaldiff + diff1[i1][j1]; } } fclose(fp1); fclose(fp2); }// else loop end return(totaldiff); } A.2 Full Search Program with SSE 2 intrinsics for 16 x 16 blocks //------------------------------------------------------------------------------------// PROGRAM TO IMPLEMENT FULL SEARCH WITH SSE2 //------------------------------------------------------------------------------------#include<stdio.h> #include<conio.h> #include<math.h> #include<stdlib.h> #include<time.h> #include<xmmintrin.h> #include<sse2mmx.h> //#include <iostream.h> #include <mmsystem.h> #include <windows.h> int blockdiff(int, int, int, int); int x1, x2, y11, y22, i,j,k, mindiff[12][10], p = 8; int diff[12][10][257]; int sort[257], temp,point, l, motionx, motiony, col ,row, x, y; long peldiff = 0; float amad; clock_t first, second; FILE *in1,*in2; void main() { int i1,j1,k1; FILE *fpold,*fpnew; int diff1,diff,diffx,diffy; amad =0.0; DWORD start, finish, duration; //first = clock(); start = timeGetTime(); printf("START TIME!! %ld\n", start); diff = 1000000; diffy =0; diffx =0; //j and i move the reference blocks for(j=0; j<9 ;j++)//j<9 9*16 = 144 { for(i=0; i<11 ;i++)//i<11 16*11 = 176 { x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 16*j; for(row = 0; row<2*p; row++) { for(col = 0; col<2*p; col++) { x2 = x1 - p + col; y22 = y11 - p + row; diff1 = blockdiff(x1, x2, y11, y22); if (diff > diff1) { diff = diff1; diffx = x2; diffy = y22; }//else discard diff1 } } motionx = diffx - x1; motiony = diffy - y11; //---------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE //---------------------------------------------------------------------fpold = fopen("C:\\ECE734\\f4.raw","rb"); fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b"); if(motionx != 0 || motiony != 0) { //SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0); fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++)//FOR REQUIRED PIXELS IN NEW IMAGE { point = fgetc(fpold); fputc(point, fpnew); } fseek(fpold,160,1); fseek(fpnew,160,1); } } fclose(fpnew); fclose(fpold); LOOP FOR WRITING // amad = 99.0f ;// 1.0;((float)motionx* motionx) + ((float)motiony*motiony); peldiff = peldiff + diff; } } //amad = amad/99; // printf("\nAMAD = %f",amad); printf("\nPixel Difference %ld", peldiff/99); second = clock(); printf("\nDifference in time %ld", second); getch(); } //-------------------------------------------------------------------------// FUNCTION BLOCKDIFF //-------------------------------------------------------------------------int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m128i *b1,*b2,m1; union mmx { __m128i m; short int x[8]; }m; //DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128) { totaldiff = 10000; } else { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb"); if(fp1 == NULL || fp2 == NULL) { printf("File cannot be opened"); getch(); exit(0); } //skip to the intial point and get the result. for(i = 0 ;i<16;i++) { offset1 = (176*(y11+i) + x1); offset2 = (176*(y22+i) + x2); fseek(fp1,offset1,SEEK_SET); fread(block1,1,16,fp1); //for(i=0;i<16;i++) //printf("%c \n",block1[i]); fseek(fp2,offset2,SEEK_SET); fread(block2,1,16,fp2); // type casting pointers. b1 = (__m128i*)block1; b2 = (__m128i*)block2; //SAD for 16 bytes. m1 = _mm_sad_epu8(*b1,*b2); m.m = m1; totaldiff = totaldiff + m.x[3] + m.x[7]; } fclose(fp1); fclose(fp2); }// else loop end return(totaldiff); } A.3 Three Step Search Program with simple C for 16 x 16 image blocks //-----------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH //-----------------------------------------------------------------------#include<stdio.h> #include<conio.h> #include<math.h> #include<stdlib.h> #include<time.h> int blockdiff(int, int, int, int); int x1, x2, y11, y22, diff[12][10][10],i,j,p = 16,k, mindiff[12][10]; int sort[10], temp,l, motionx, motiony; long float amad = 0.0, dist =0.0; long peldiff = 0; int locx, locy, point; time_t first, second; void main() { FILE *fpold, *fpnew; int i1,j1,k1,ch,c; first = time(NULL); for(j=0; j<9 ;j++) { for(i=0; i<11 ;i++) { x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 16*j; locx = x1; locy = y11; p = 16; while(p >= 1) { //ALGORITHM FOR 3 STEP SEARCH //FOR POINT NO. 0 x2 = locx - p; y22 = locy - p; diff[i][j][0] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 1 x2 = locx; y22 = locy - p; diff[i][j][1] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 2 x2 = locx + p; y22 = locy - p; diff[i][j][2] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 3 x2 = locx - p; y22 = locy; diff[i][j][3] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 4 x2 = locx; y22 = locy; diff[i][j][4] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 5 x2 = locx + p; y22 = locy; diff[i][j][5] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 6 x2 = locx - p; y22 = locy + p; diff[i][j][6] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 7 x2 = locx; y22 = locy + p; diff[i][j][7] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 8 x2 = locx + p; y22 = locy + p; diff[i][j][8] = blockdiff(x1, x2, y11, y22); //-----------------------------------------------------------------------//SUBPROGARM FOR SORTING THE DIFFERENCES //----------------------------------------------------------------------- // // for(k=0; k<9 ; k++) { sort[k] = diff[i][j][k]; printf("\nk=%d diff=%d", k, diff[i][j][k]); } for(k=0; k<9 ;k++) { for(l=0; l<9 ; l++) { if(sort[l] < sort[l+1]) { temp = sort[l]; sort[l] = sort[l+1]; sort[l+1] = temp; } } } mindiff[i][j] = sort[8]; printf("\nmindiff=%d", mindiff[i][j]); for(k=0; k<9; k++) { if(diff[i][j][k] == sort[8]) { l = k; } } if(l==0) { } if(l==1) { } locx = locx - p; locy = locy - p; locx = locx; locy = locy - p; if(l==2) { } if(l==3) { } if(l==4) { locx = locx + p; locy = locy - p; locx = locx - p; locy = locy; locx = locx; locy = locy; } if(l==5) { } if(l==6) { } if(l==7) { locx = locx + p; locy = locy; locx = locx - p; locy = locy + p; locx = locx; locy = locy + p; } if(l==8) { locx = locx + p; locy = locy + p; } p = p/2; } //while loop end. motionx = locx - x1; motiony = locy - y11; // printf("\n\tx1 %d y11 %d \n \tlocx %d locy %d \n\tmotion vector x=%d y=%d",x1,y11, locx, locy, motionx , motiony); // printf("\n\t%d %d",(j*11)+i, mindiff[i][j]); // getch(); dist = sqrt( motionx*motionx + motiony*motiony); amad = amad + dist; peldiff = peldiff + mindiff[i][j]; //* //---------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE //---------------------------------------------------------------------fpold = fopen("C:\\ECE734\\f4.raw","rb"); fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b"); if(motionx != 0 || motiony != 0) { //SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0); fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++)//FOR REQUIRED PIXELS IN NEW IMAGE { point = fgetc(fpold); fputc(point, fpnew); } LOOP FOR fseek(fpold,160,1); fseek(fpnew,160,1); } fclose(fpnew); fclose(fpold);//*/ } } } } second = time(NULL); printf("Time taken %ld", second - first);//TO FIND CPU TIME printf("\nAMAD=%lf", amad/99); printf("\nPixel Difference %ld", peldiff/99); getch(); //-------------------------------------------------------------------------// FUNCTION BLOCKDIFF //-------------------------------------------------------------------------int blockdiff(int x1,int x2,int y11,int y22) { int block1[16][16], block2[16][16],i1,j1,k1,ch; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; //DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128) { totaldiff = 10000; } else WRITING { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb"); if(fp2 == NULL) { printf("File 2 cannot be opened"); getch(); exit(0); } if(fp1 == NULL) { printf("File 1 cannot be opened"); getch(); exit(0); } for(i1=0; i1<16 ; i1++) { for(j1=0; j1<16 ;j1++) { diff1[i1][j1] = 0; block1[i1][j1]= 0; block2[i1][j1]= 0; } } for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT ch = fgetc(fp1); //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++) { block1[i1][j1] = fgetc(fp1); } for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED { ch = fgetc(fp1); } } //BLOCK COPIED FROM SECOND FRAME for(i1=0; i1<(176*y22) + x2; i1++)//SKIP PIXELS UPTO INTIAL POINT ch = fgetc(fp2); //COPY REQUIRED PIXELS FROM BLOCK 2 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++) { block2[i1][j1] = fgetc(fp2);} for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED {ch = fgetc(fp2);} } for(i1=0; i1<16 ; i1++) { for(j1=0; j1<16 ;j1++) { diff1[i1][j1] = block2[i1][j1] - block1[i1][j1]; diff1[i1][j1] = abs(diff1[i1][j1]); totaldiff = totaldiff + diff1[i1][j1]; } } } fclose(fp1); fclose(fp2); }// else loop end return(totaldiff); A.4 Three Step Search Program with SSE 2 intrinsics for 16 x 16 blocks //-----------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH //-----------------------------------------------------------------------#include<stdio.h> #include<conio.h> #include<math.h> #include<stdlib.h> #include<time.h> #include<xmmintrin.h> #include<sse2mmx.h> int blockdiff(int, int, int, int); int x1, x2, y11, y22, diff[12][10][10],i,j,p = 16,k, mindiff[12][10]; int sort[10], temp,l, motionx, motiony; long float amad = 0.0, dist =0.0; long peldiff = 0; int locx, locy, point; time_t first, second; void main() { FILE *fpold, *fpnew; int i1,j1,k1,ch,c; first = time(NULL); for(j=0; j<9 ;j++) { for(i=0; i<11 ;i++) { x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 16*j; locx = x1; locy = y11; p = 16; while(p >= 1) { //ALGORITHM FOR 3 STEP SEARCH //FOR POINT NO. 0 x2 = locx - p; y22 = locy - p; diff[i][j][0] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 1 x2 = locx; y22 = locy - p; diff[i][j][1] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 2 x2 = locx + p; y22 = locy - p; diff[i][j][2] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 3 x2 = locx - p; y22 = locy; diff[i][j][3] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 4 x2 = locx; y22 = locy; diff[i][j][4] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 5 x2 = locx + p; y22 = locy; diff[i][j][5] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 6 x2 = locx - p; y22 = locy + p; diff[i][j][6] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 7 x2 = locx; y22 = locy + p; diff[i][j][7] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 8 x2 = locx + p; y22 = locy + p; diff[i][j][8] = blockdiff(x1, x2, y11, y22); //-----------------------------------------------------------------------//SUBPROGARM FOR SORTING THE DIFFERENCES //----------------------------------------------------------------------for(k=0; k<9 ; k++) { sort[k] = diff[i][j][k]; // printf("\nk=%d diff=%d", k, diff[i][j][k]); } // for(k=0; k<9 ;k++) { for(l=0; l<9 ; l++) { if(sort[l] < sort[l+1]) { temp = sort[l]; sort[l] = sort[l+1]; sort[l+1] = temp; } } } mindiff[i][j] = sort[8]; printf("\nmindiff=%d", mindiff[i][j]); for(k=0; k<9; k++) { if(diff[i][j][k] == sort[8]) { l = k; } } if(l==0) { locx = locx - p; locy = locy - p; } if(l==1) { } locx = locx; locy = locy - p; if(l==2) { } if(l==3) { } if(l==4) { } locx = locx + p; locy = locy - p; locx = locx - p; locy = locy; locx = locx; locy = locy; if(l==5) { } if(l==6) locx = locx + p; locy = locy; { locx = locx - p; locy = locy + p; } if(l==7) { } if(l==8) { } locx = locx; locy = locy + p; locx = locx + p; locy = locy + p; p = p/2; } //while loop end. motionx = locx - x1; motiony = locy - y11; // printf("\n\tx1 %d y11 %d \n \tlocx %d locy %d \n\tmotion vector x=%d y=%d",x1,y11, locx, locy, motionx , motiony); // printf("\n\t%d %d",(j*11)+i, mindiff[i][j]); // getch(); dist = sqrt( motionx*motionx + motiony*motiony); amad = amad + dist; peldiff = peldiff + mindiff[i][j]; //* //---------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE //---------------------------------------------------------------------fpold = fopen("C:\\ECE734\\f4.raw","rb"); fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b"); if(motionx != 0 || motiony != 0) { //SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0); fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<16; i1++) { for(j1 =0; j1<16; j1++)//FOR REQUIRED PIXELS IN NEW IMAGE { point = fgetc(fpold); fputc(point, fpnew); } fseek(fpold,160,1); fseek(fpnew,160,1); } fclose(fpnew); fclose(fpold);//*/ } LOOP FOR WRITING } } second = time(NULL); printf("Time taken %ld", second - first);//TO FIND CPU TIME printf("\nAMAD=%lf", amad/99); printf("\nPixel Difference %ld", peldiff/99); getch(); } //-------------------------------------------------------------------------// FUNCTION BLOCKDIFF //-------------------------------------------------------------------------int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m128i *b1,*b2,m1; union mmx __m128i m; short int x[8]; }m; { //DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128) { totaldiff = 10000; } else { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb"); if(fp1 == NULL || fp2 == NULL) { printf("File cannot be opened"); getch(); exit(0); } //skip to the intial point and get the result. for(i = 0 ;i<16;i++) { offset1 = (176*(y11+i) + x1); offset2 = (176*(y22+i) + x2); fseek(fp1,offset1,SEEK_SET); fread(block1,1,16,fp1); //for(i=0;i<16;i++) //printf("%c \n",block1[i]); fseek(fp2,offset2,SEEK_SET); fread(block2,1,16,fp2); // type casting pointers. b1 = (__m128i*)block1; b2 = (__m128i*)block2; //SAD for 16 bytes. m1 = _mm_sad_epu8(*b1,*b2); m.m = m1; totaldiff = totaldiff + m.x[3] + m.x[7]; } fclose(fp1); fclose(fp2); }// else loop end return(totaldiff); } A.5 Full Search with simple C for 8 x 8 blocks //-----------------------------------------------------------------------// PROGRAM TO IMPLEMENT FULL SEARCH //-----------------------------------------------------------------------#include<stdio.h> #include<conio.h> #include<math.h> #include<stdlib.h> #include<time.h> int blockdiff(int, int, int, int); int x1, x2, y11, y22, i,j,k, mindiff[22][18], p = 4; //int diff[22][18][256]; int sort[257], temp,point, l, motionx, motiony, col ,row, x, y; long peldiff = 0; float amad = 0.0; time_t first, second; int diff1,diff,diffx,diffy; void main() { int i1,j1,k1; FILE *fpold,*fpnew; first = time(NULL); for(j=0; j<18 ;j++) { for(i=0; i<22 ;i++) { x1 = 8*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 8*j; diff = 1000000; //defines search space for(row = 0; row < 2*p ; row++) { //printf("\n"); for(col = 0; col < 2*p; col++) { x2 = x1 - p + col; y22 = y11 - p + row; diff1 = blockdiff(x1, x2, y11, y22); if(diff > diff1) { diff = diff1; diffx = x2; diffy = y22; } //printf("diff %d ",diff[i][j][(16*row) + col]); //if(diff[i][j][(16*row) + col] == 0) //{printf(" i %d j %d row %d col %d",i,j,row,col); //getchar(); } } //} //-----------------------------------------------------------------------//SUBPROGARM FOR SORTING THE DIFFERENCES //----------------------------------------------------------------------/* for(k=0; k<256 ; k++) { sort[k] = diff[i][j][k]; // printf("\tk%d diff%d", k, diff[i][j][k]); } for(k=0; k<256 ;k++) { for(l=0; l<256 ; l++) { if(sort[l] < sort[l+1]) { temp = sort[l]; sort[l] = sort[l+1]; sort[l+1] = temp; } } } mindiff[i][j] = sort[254]; // printf("\nmindiff=%d", mindiff[i][j]); // printf("\nsort=%d", sort[255]); // getch(); for(k=0; k<256; k++) { if(diff[i][j][k] == sort[255]) { l = k; } } x = l % 16; y = (l - x)/16; */ motionx = diffx -x1; motiony = diffy -y11; // // printf("\t %d %d %d %d", x1, y11, motionx , motiony); printf("\n\t%d %d",(j*11)+i, mindiff[i][j]); //---------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE //---------------------------------------------------------------------fpold = fopen("C:\\ECE734\\f4.raw","rb"); fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b"); if(motionx != 0 || motiony != 0) { //SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0); fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<8; i1++) { for(j1 =0; j1<8; j1++)//FOR LOOP FOR WRITING REQUIRED PIXELS IN NEW IMAGE { point = fgetc(fpold); fputc(point, fpnew); } fseek(fpold,168,1); fseek(fpnew,168,1); } } fclose(fpnew); fclose(fpold); //amad = amad + sqrt(motionx*motionx + motiony*motiony); peldiff = peldiff + mindiff[i][j]; } } amad = amad/99; printf("\nAMAD = %f", amad); printf("\nPixel Difference %ld", peldiff/99); second=time(NULL); printf("\nDifference in time %ld", second - first); getch(); } //-------------------------------------------------------------------------// FUNCTION BLOCKDIFF //-------------------------------------------------------------------------int blockdiff(int x1,int x2,int y11,int y22) { int block1[8][8], block2[8][8],i1,j1,k1,ch; int diff1[8][8]; register int totaldiff = 0; FILE *fp1,*fp2; //DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME if(x2 < 0 || y22 < 0 || x2 >168 || y22 >136) { totaldiff = 10000; } else { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb"); if(fp1 == NULL || fp2 == NULL) { printf("File cannot be opened"); getch(); exit(0); } for(i1=0; i1<8 ; i1++) { for(j1=0; j1<8 ;j1++) { diff1[i1][j1] = 0; block1[i1][j1]= 0; block2[i1][j1]= 0; } } for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT ch = fgetc(fp1); //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<8; i1++) { for(j1 =0; j1<8; j1++) { block1[i1][j1] = fgetc(fp1); } for(k1=0; k1< 168; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED { ch = fgetc(fp1); } } //BLOCK COPIED FROM SECOND FRAME for(i1=0; i1<(176*y22) + x2; i1++)//SKIP PIXELS UPTO INTIAL POINT ch = fgetc(fp2); //COPY REQUIRED PIXELS FROM BLOCK 2 for(i1 = 0; i1<8; i1++) { for(j1 =0; j1<8; j1++) { block2[i1][j1] = fgetc(fp2);} for(k1=0; k1< 168; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED {ch = fgetc(fp2);} } for(i1=0; i1<8 ; i1++) { for(j1=0; j1<8 ;j1++) { diff1[i1][j1] = block2[i1][j1] - block1[i1][j1]; diff1[i1][j1] = abs(diff1[i1][j1]); totaldiff = totaldiff + diff1[i1][j1]; } } fclose(fp1); fclose(fp2); }// else loop end return(totaldiff); } A.7 Three Step Search with simple C for 8 x 8 blocks //-----------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH //-----------------------------------------------------------------------#include<stdio.h> #include<conio.h> #include<math.h> #include<stdlib.h> #include<time.h> int blockdiff(int, int, int, int); int x1, x2, y11, y22, diff[23][19][10],i,j,p = 8,k, mindiff[23][19]; int sort[10], temp,l, motionx, motiony; long float amad = 0.0, dist =0.0; long peldiff = 0; int locx, locy, point; time_t first, second; void main() { FILE *fpold, *fpnew; int i1,j1,k1,ch,c; //clrscr(); first = time(NULL); for(j=0; j<18 ;j++) { for(i=0; i<22 ;i++) { x1 = 8*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 8*j; locx = x1; locy = y11; p = 8; while(p >= 1) { //ALGORITHM FOR 3 STEP SEARCH //FOR POINT NO. 0 x2 = locx - p; y22 = locy - p; diff[i][j][0] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 1 x2 = locx; y22 = locy - p; diff[i][j][1] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 2 x2 = locx + p; y22 = locy - p; diff[i][j][2] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 3 x2 = locx - p; y22 = locy; diff[i][j][3] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 4 x2 = locx; y22 = locy; diff[i][j][4] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 5 x2 = locx + p; y22 = locy; diff[i][j][5] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 6 x2 = locx - p; y22 = locy + p; diff[i][j][6] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 7 x2 = locx; y22 = locy + p; diff[i][j][7] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 8 x2 = locx + p; y22 = locy + p; diff[i][j][8] = blockdiff(x1, x2, y11, y22); //-----------------------------------------------------------------------//SUBPROGARM FOR SORTING THE DIFFERENCES //----------------------------------------------------------------------for(k=0; k<9 ; k++) { sort[k] = diff[i][j][k]; // printf("\nk=%d diff=%d", k, diff[i][j][k]); } for(k=0; k<9 ;k++) { for(l=0; l<9 ; l++) { if(sort[l] < sort[l+1]) { temp = sort[l]; sort[l] = sort[l+1]; sort[l+1] = temp; } } } mindiff[i][j] = sort[8]; // printf("\nmindiff=%d", mindiff[i][j]); for(k=0; k<9; k++) { if(diff[i][j][k] == sort[8]) { l = k; } } if(l==0) { locx = locx - p; locy = locy - p; } if(l==1) { locx = locx; locy = locy - p; } if(l==2) { locx = locx + p; locy = locy - p; } if(l==3) { locx = locx - p; locy = locy; } if(l==4) { locx = locx; locy = locy; } if(l==5) { locx = locx + p; locy = locy; } if(l==6) { locx = locx - p; locy = locy + p; } if(l==7) { locx = locx; locy = locy + p; } if(l==8) { locx = locx + p; locy = locy + p; } p = p/2; } //while loop end. motionx = locx - x1; motiony = locy - y11; // printf("\n\tx1 %d y11 %d \n \tlocx %d locy %d \n\tmotion vector x=%d y=%d",x1,y11, locx, locy, motionx , motiony); // printf("\n\t%d %d",(j*11)+i, mindiff[i][j]); // getch(); dist = sqrt( motionx*motionx + motiony*motiony); amad = amad + dist; peldiff = peldiff + mindiff[i][j]; ///* //---------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE //---------------------------------------------------------------------fpold = fopen("C:\\ECE734\\f4.raw","rb"); fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b"); if(motionx != 0 || motiony != 0) { //SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0); fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<8 ;i1++) { for(j1 =0; j1<8; j1++)//FOR LOOP FOR WRITING REQUIRED PIXELS IN NEW IMAGE { point = fgetc(fpold); fputc(point, fpnew); } fseek(fpold,168,1); fseek(fpnew,168,1); } } fclose(fpnew); fclose(fpold);//*/ } } second = time(NULL); printf("Time taken %ld", second - first);//TO FIND CPU TIME printf("\nAMAD=%lf", amad/(22*18)); printf("\nPixel Difference %ld", peldiff/(22*18)); getch(); } //-------------------------------------------------------------------------// FUNCTION BLOCKDIFF //-------------------------------------------------------------------------int blockdiff(int x1,int x2,int y11,int y22) { int block1[8][8], block2[8][8],i1,j1,k1,ch; int diff1[8][8], totaldiff = 0; FILE *fp1, *fp2; //DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME if(x2 < 0 || y22 < 0 || x2 >168 || y22 >136) { totaldiff = 10000; } else { fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb"); if(fp2 == NULL) { printf("File 2 cannot be opened"); getch(); exit(0); } if(fp1 == NULL) { printf("File 1 cannot be opened"); getch(); exit(0); } for(i1=0; i1<8 ; i1++) { for(j1=0; j1<8 ;j1++) { diff1[i1][j1] = 0; block1[i1][j1]= 0; block2[i1][j1]= 0; } } // for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT // ch = fgetc(fp1); fseek(fp1,(176 * y11) + x1,0); //COPY REQUIRED PIXELS FROM BLOCK 1 for(i1 = 0; i1<8; i1++) { for(j1 =0; j1<8; j1++) { block1[i1][j1] = fgetc(fp1); } for(k1=0; k1< 168; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED { ch = fgetc(fp1); } } //BLOCK COPIED FROM SECOND FRAME for(i1=0; i1<(176*y22) + x2; i1++)//SKIP PIXELS UPTO INTIAL POINT ch = fgetc(fp2); //COPY REQUIRED PIXELS FROM BLOCK 2 for(i1 = 0; i1<8; i1++) { for(j1 =0; j1<8; j1++) { block2[i1][j1] = fgetc(fp2);} for(k1=0; k1< 168; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED {ch = fgetc(fp2);} } for(i1=0; i1<8 ; i1++) { for(j1=0; j1<8 ;j1++) { diff1[i1][j1] = block2[i1][j1] - block1[i1][j1]; diff1[i1][j1] = abs(diff1[i1][j1]); totaldiff = totaldiff + diff1[i1][j1]; } } fclose(fp1); fclose(fp2); }// else loop end return(totaldiff); } A.8 Three Step Search Program with SSE 2 for 8 x 8 blocks //-----------------------------------------------------------------------// PROGRAM TO IMPLEMENT 3 STEP SEARCH //-----------------------------------------------------------------------#include<stdio.h> #include<conio.h> #include<math.h> #include<stdlib.h> #include<time.h> #include<sse2mmx.h> #include<xmmintrin.h> int blockdiff(int, int, int, int); int x1, x2, y11, y22, diff[23][19][10],i,j,p = 8,k, mindiff[23][19]; int sort[10], temp,l, motionx, motiony; long float amad = 0.0, dist =0.0; long peldiff = 0; int locx, locy, point; time_t first, second; void main() { FILE *fpold, *fpnew; int i1,j1,k1,ch,c; first = time(NULL); for(j=0; j<18 ;j++) { for(i=0; i<22 ;i++) { x1 = 8*i; //START OF BLOCKS IN REFERENCE IMAGE y11 = 8*j; locx = x1; locy = y11; p = 8; while(p >= 1) { //ALGORITHM FOR 3 STEP SEARCH //FOR POINT NO. 0 x2 = locx - p; y22 = locy - p; diff[i][j][0] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 1 x2 = locx; y22 = locy - p; diff[i][j][1] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 2 x2 = locx + p; y22 = locy - p; diff[i][j][2] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 3 x2 = locx - p; y22 = locy; diff[i][j][3] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 4 x2 = locx; y22 = locy; diff[i][j][4] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 5 x2 = locx + p; y22 = locy; diff[i][j][5] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 6 x2 = locx - p; y22 = locy + p; diff[i][j][6] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 7 x2 = locx; y22 = locy + p; diff[i][j][7] = blockdiff(x1, x2, y11, y22); //FOR POINT NO. 8 x2 = locx + p; y22 = locy + p; diff[i][j][8] = blockdiff(x1, x2, y11, y22); //-----------------------------------------------------------------------//SUBPROGARM FOR SORTING THE DIFFERENCES //----------------------------------------------------------------------- // // for(k=0; k<9 ; k++) { sort[k] = diff[i][j][k]; printf("\nk=%d diff=%d", k, diff[i][j][k]); } for(k=0; k<9 ;k++) { for(l=0; l<9 ; l++) { if(sort[l] < sort[l+1]) { temp = sort[l]; sort[l] = sort[l+1]; sort[l+1] = temp; } } } mindiff[i][j] = sort[8]; printf("\nmindiff=%d", mindiff[i][j]); for(k=0; k<9; k++) { if(diff[i][j][k] == sort[8]) { l = k; } } if(l==0) { locx = locx - p; locy = locy - p; } if(l==1) { } if(l==2) { locx = locx; locy = locy - p; locx = locx + p; locy = locy - p; } if(l==3) { } if(l==4) { } if(l==5) { locx = locx - p; locy = locy; locx = locx; locy = locy; locx = locx + p; locy = locy; } if(l==6) { } locx = locx - p; locy = locy + p; if(l==7) { } if(l==8) { } locx = locx; locy = locy + p; locx = locx + p; locy = locy + p; p = p/2; } //while loop end. motionx = locx - x1; motiony = locy - y11; // printf("\n\tx1 %d y11 %d \n \tlocx %d locy %d \n\tmotion vector x=%d y=%d",x1,y11, locx, locy, motionx , motiony); // printf("\n\t%d %d",(j*11)+i, mindiff[i][j]); // getch(); dist = sqrt( motionx*motionx + motiony*motiony); amad = amad + dist; peldiff = peldiff + mindiff[i][j]; //---------------------------------------------------------------------// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE //---------------------------------------------------------------------fpold = fopen("C:\\ECE734\\f4.raw","rb"); fpnew = fopen("C:\\ECE734\\f6recon.raw","r+b"); if(motionx != 0 || motiony != 0) { //SKIP PIXELS UPTO INITIAL POINT fseek(fpold,(176 * y11) + x1, 0); fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS USING MOTION VECTOR FOR THE NEW IMAGE //COPY REQUIRED PIXELS FROM BLOCK 1 PIXELS IN NEW IMAGE for(i1 = 0; i1<8 ;i1++) { for(j1 =0; j1<8; j1++)//FOR LOOP FOR WRITING REQUIRED { point = fgetc(fpold); fputc(point, fpnew); } fseek(fpold,168,1); fseek(fpnew,168,1); } fclose(fpnew); fclose(fpold);//*/ } } } } second = time(NULL); printf("Time taken %ld", second - first);//TO FIND CPU TIME printf("\nAMAD=%lf", amad/(22*18)); printf("\nPixel Difference %ld", peldiff/(22*18)); getch(); //-------------------------------------------------------------------------// FUNCTION BLOCKDIFF //-------------------------------------------------------------------------int blockdiff(int x1,int x2,int y11,int y22) { unsigned char block1[16], block2[16]; int i1,j1,k1,ch,offset1,offset2; int diff1[16][16], totaldiff = 0; FILE *fp1, *fp2; __m64 *b1,*b2,m1; union mmx }m; OF FRAME { __m64 m; int x[2]; //DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END if(x2 < 0 || y22 < 0 || x2 >168 || y22 >136) { } else { totaldiff = 100000; fp1 = fopen("C:\\ECE734\\f4.raw", "rb"); fp2 = fopen("C:\\ECE734\\f5.raw", "rb"); if(fp1 == NULL || fp2 == NULL) { printf("File cannot be opened"); getch(); exit(0); } //skip to the intial point and get the result. for(i1 = 0 ;i1<8;i1++) { offset1 = (176*(y11 + i1) + x1); offset2 = (176*(y22 + i1) + x2); fseek(fp1,offset1,SEEK_SET); fread(block1,1,8,fp1); fseek(fp2,offset2,SEEK_SET); fread(block2,1,8,fp2); // type casting pointers. b1 = (__m64*)block1; b2 = (__m64*)block2; } return(totaldiff); } //SAD for 16 bytes. m1 = _m_psadbw(*b1,*b2); m.m = m1; totaldiff = totaldiff + m.x[0]; fclose(fp1); fclose(fp2); }// else loop end /***************************************************************************** * File Name : DCTmatrix.c * * Comment : This file will help produce ROM values stored to compute DCT * using distributed arithmetic. * * Project :ECE734 * Author :Shamik Valia * ****************************************************************************/ /* Arguments for the execution * foo precision * command line * gcc -lm -g DCTmatrix.c -o DCTmatrix */ #include<stdio.h> #include<math.h> #include<stdlib.h> #define FOR_HARDWARE //#define FOR_SIMULATOR main(int argc,char *argv[]) { FILE *R1,*R2,*R3,*R4,*R5,*R6,*R7,*R8 ; double A,B,C,D,E,F,G,v1; int i,j,k,l; //char *C; char c[atoi(argv[1])+1] ; void dectobinary(double ,int, char[]); void complement(char[]); // printf("am here please print"); A = cos(M_PI/4); B = cos(M_PI/8); C = sin(M_PI/8); D = cos(M_PI/16); E = cos(3*M_PI/16); F = sin(3*M_PI/16); G = sin(M_PI/16); printf("PI = %lf \n",M_PI); printf("A = %lf \n",A); printf("B = %lf \n",B); printf("C = %lf \n",C); printf("D = %lf \n",D); printf("E = %lf \n",E); printf("F = %lf \n",F); printf("G = %lf \n",G); R1 = fopen("dct_ROM1.dat","w"); R2 = fopen("dct_ROM2.dat","w"); R3 = fopen("dct_ROM3.dat","w"); R4 = fopen("dct_ROM4.dat","w"); R5 = fopen("dct_ROM5.dat","w"); R6 = fopen("dct_ROM6.dat","w"); R7 = fopen("dct_ROM7.dat","w"); R8 = fopen("dct_ROM8.dat","w"); for(i=0;i<=1;i++){ for(j=0;j<=1;j++){ for(k=0;k<=1;k++) { for(l=0;l<=1;l++){ #ifdef FOR_SIMULATOR v1=0.5*((A*i)+(A*j)+(A*k)+(A*l)); fprintf(R1,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((B*i)+(C*j)-(C*k)-(B*l)); fprintf(R2,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((A*i)-(A*j)-(A*k)+(A*l)); fprintf(R3,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((C*i)-(B*j)+(B*k)-(C*l)); fprintf(R4,"sequence %d%d%d%d value = %lf \n ",i,j,k,l,v1); v1=0.5*((D*i)+(E*j)+(F*k)+(G*l)); fprintf(R5,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((E*i)-(G*j)-(D*k)-(F*l)); fprintf(R6,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((F*i)-(D*j)+(G*k)+(E*l)); fprintf(R7,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); v1=0.5*((G*i)-(F*j)+(E*k)-(D*l)); fprintf(R8,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1); #endif #ifdef FOR_HARDWARE v1=0.5*((A*i)+(A*j)+(A*k)+(A*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R1,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((B*i)+(C*j)-(C*k)-(B*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R2,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((A*i)-(A*j)-(A*k)+(A*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R3,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((C*i)-(B*j)+(B*k)-(C*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R4,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((D*i)+(E*j)+(F*k)+(G*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R5,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((E*i)-(G*j)-(D*k)-(F*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R6,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((F*i)-(D*j)+(G*k)+(E*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R7,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); v1=0.5*((G*i)-(F*j)+(E*k)-(D*l)); dectobinary(v1,atoi(argv[1]),c); fprintf(R8,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c); #endif } } } } fclose (R1); fclose (R2); fclose (R3); fclose (R4); fclose (R5); fclose (R6); fclose (R7); fclose (R8); } void dectobinary(double no ,int precision,char c[]) { //double no; ///int precision ; //given the nos before the decimal point it will give binary representation float fixed ; int decimal ; int ans = 0, x ,i=0,two_complement = 0; void complement(char[]); //char C[precision]; //printf("i am inside"); if (no<0){ two_complement = 1; no = - no ; } decimal = (int)no ; fixed = no-decimal; while(decimal/2 !=0) { x = decimal % 2 ; ans = ans + x*pow(10,i); i++; decimal = decimal / 2 ; } x= decimal%2; ans = ans + x*pow(10,i); sprintf(c,"%d",ans); if(strlen(c)!=2) { if(strlen(c)>2) fprintf(stderr,"error at decimal..more than 2 places"); c[1] = c[0]; c[0] = '0'; c[2]='\0'; } // printf("stringlenth is %d",strlen(c)); //given the nos after decimal point , gives the binary. for(i=strlen(c);i<precision;i++) { fixed = 2*fixed ; if(fixed>=1.0) { fixed = fixed-1.0 ; c[i] = '1'; } else c[i] ='0'; } c[i]='\0'; if(two_complement==1){ complement(c); } } // return C ; void complement(char c[]){ int i = strlen(c); int flag = 0; while(i!=-1){ if (flag ==0){ } if(c[i]=='1') flag = 1; i--; }else //flag ==0 ends { //flag ==1 if(c[i]=='1') c[i]='0'; else c[i]='1'; i--; } //flag==1 ends } //while ends /* Real dct implementation. x0 = x0 + x7 ; x1 = x1 + x6 ; x2 = x2 + x5 ; x3 = x3 + x4 ; x4 = x0 - x7 ; x5 = x0 - x6 ; x6 = x0 - x5 ; x7 = x0 - x4 ; X0 = A(x0) + A(x1) + A(x2) + A(x3); X2 = B(x0) + C(x1) - C(x2) - B(x3); X4 = A(x0) - A(x1) - A(x2) + A(x3); X6 = C(x0) - B(x1) - B(x2) - C(x3); X1 = D(x4) + E(x5) + F(x6) + G(x7); X3 = E(x4) - G(x5) - D(x6) - F(x7); X5 = F(x4) - D(x5) + G(x6) + E(x7); X7 = G(x4) - F(x5) + E(x6) - D(x7); *************************************************/ #include<stdio.h> #include<math.h> #include<stdlib.h> main () { short A,B,C,D,E,F,G; short int X0,X1,X2,X3,X4,X5,X6,X7 ; short int x0,x1,x2,x3,x4,x5,x6,x7; double X0_,X1_,X2_,X3_,X4_,X5_,X6_,X7_; int Add_SS2(int ,int); int n1,n2,n3; /* A = cos(M_PI/4); B = cos(M_PI/8); C = sin(M_PI/8); D = cos(M_PI/16); E = cos(3*M_PI/16); F = sin(3*M_PI/16); G = sin(M_PI/16); */ A =23170; B=30273; C=12539; D=32138; E=27245; F=18204; G=6392; printf("input x0 : "); scanf("%hd",&x0); printf("\ninput x1 : "); scanf("%hd",&x1); printf("\ninput x2 : "); scanf("%hd",&x2); printf("\ninput x3 : "); scanf("%hd",&x3); printf("\ninput x4 : "); scanf("%hd",&x4); printf("\ninput x5 : "); scanf("%hd",&x5); printf("\ninput x6 : "); scanf("%hd",&x6); printf("\ninput x7 : "); scanf("%hd",&x7); /* X0 = (short int)(((A*x0) + (A*x1) + (A*x2) + (A*x3))>>1); X2 = (short int)(((B*x0) + (C*x1) - (C*x2) - (B*x3))>>1); X4 = (short int)(((A*x0) - (A*x1) - (A*x2) + (A*x3))>>1); X6 = (short int)(((C*x0) - (B*x1) - (B*x2) - (C*x3))>>1); X1 = (short int)(((D*x4) + (E*x5) + (F*x6) + (G*x7))>>1); X3 = (short int)(((E*x4) - (G*x5) - (D*x6) - (F*x7))>>1); X5 = (short int)(((F*x4) - (D*x5) + (G*x6) + (E*x7))>>1); X7 = (short int)(((G*x4) - (F*x5) + (E*x6) - (D*x7))>>1); */ n1 = Add_SS2((A*x0) ,(A*x1)); n2 = Add_SS2((A*x2) ,(A*x3)); n3 = Add_SS2(n1,n2); X0 = (short int)(n3>>17); n1 = Add_SS2((B*x0) ,- (C*x2)); n2 = Add_SS2((C*x1) ,- (B*x3)); n3 = Add_SS2(n1,n2); X2 = (short int)(n3>>17); // X2=X2+ (((B*x_[0]) + (C*x_[1]) - (C*x_[2]) - (B*x_[3]))/(power(2,15-i))); n1 = Add_SS2((A*x0) , - (A*x1)); n2 = Add_SS2((A*x3),- (A*x2) ); n3 = Add_SS2(n1,n2); X4 = (short int)(n3>>17); //X4=X4+ (((A*x_[0]) - (A*x_[1]) - (A*x_[2]) + (A*x_[3]))/(power(2,15-i))); n1 = Add_SS2((C*x0), - (B*x1)); n2 = Add_SS2((B*x2),- (C*x3) ); n3 = Add_SS2(n1,n2); X6 = (short int)(n3>>17); //X6=X6+ (((C*x_[0]) - (B*x_[1]) + (B*x_[2]) - (C*x_[3]))/(power(2,15-i))); n1 = Add_SS2((D*x4),(E*x5)); n2 = Add_SS2( (F*x6), (G*x7) ); n3 = Add_SS2(n1,n2); X1 = (short int)(n3>>17); //X1=X1+ (((D*x_[4]) + (E*x_[5]) + (F*x_[6]) + (G*x_[7]))/(power(2,15-i))); n1 = Add_SS2((E*x4),- (G*x5)); n2 = Add_SS2(- (D*x6), - (F*x7) ); n3 = Add_SS2(n1,n2); X3 = (short int)(n3>>17); // X3=X3+ (((E*x_[4]) - (G*x_[5]) - (D*x_[6]) - (F*x_[7]))/(power(2,15-i))); n1 = Add_SS2((F*x4), - (D*x5)); n2 = Add_SS2((G*x6), (E*x7) ); n3 = Add_SS2(n1,n2); X5 = (short int)(n3>>17); // X5=X5+ ((F*x_[4]) - (D*x_[5]) + (G*x_[6]) + (E*x_[7]))/(power(2,15-i))); n1 = Add_SS2((G*x4), - (F*x5)); n2 = Add_SS2((E*x6),- (D*x7)); n3 = Add_SS2(n1,n2); X7 = (short int)(n3>>17); printf("X0 = %hd \n",X0); printf("X1 = %hd \n",X1); printf("X2 = %hd \n",X2); printf("X3 = %hd \n",X3); printf("X4 = %hd \n",X4); printf("X5 = %hd \n",X5); printf("X6 = %hd \n",X6); printf("X7 = %hd \n",X7); X0_ = 0.5*((A*x0) + (A*x1) + (A*x2) + (A*x3)); X2_ = 0.5*((B*x0) + (C*x1) - (C*x2) - (B*x3)); X4_ = 0.5*((A*x0) - (A*x1) - (A*x2) + (A*x3)); X6_ = 0.5*((C*x0) - (B*x1) - (B*x2) - (C*x3)); X1_ = 0.5*((D*x4) + (E*x5) + (F*x6) + (G*x7)); X3_ = 0.5*((E*x4) - (G*x5) - (D*x6) - (F*x7)); X5_ = 0.5*((F*x4) - (D*x5) + (G*x6) + (E*x7)); X7_ = 0.5*((G*x4) - (F*x5) + (E*x6) - (D*x7)); printf("X0 = %lf \n",X0_); printf("X1 = %lf \n",X1_); printf("X2 = %lf \n",X2_); printf("X3 = %lf \n",X3_); printf("X4 = %lf \n",X4_); printf("X5 = %lf \n",X5_); printf("X6 = %lf \n",X6_); printf("X7 = %lf \n",X7_); } int Add_SS2(int a, int b) { int out; out = a + b; if (a > 0 && b > 0 && out < 0) return 0x7fffffff; else if (a < 0 && b < 0 && out > 0) return 0x10000000; else return out; } /***************************************************************************** * File Name : ROM.c * * Comment : This file will help produce ROM code in verilog * * Project :ECE699 * Author :Shamik Valia * Creation Date : 27 July 2003. * Advisor : Prof Mike Schulte. ****************************************************************************/ /* Execution format will be * ROM inputfile inputbits# outputfile outputbits# */ #include<stdio.h> main(int argc,char *argv[]) { FILE *fo; //checking for the right syntax given. if(argc!=5) { fprintf(stderr,"Insufficient arguments"); fprintf(stderr,"Requires 5 arguments"); } //open file to read // fi = fopen(argv[1],"r"); //open a new file to write output fo = fopen(argv[3],"w"); //Copyright information... fprintf(fo,"//#########################################################\n"); fprintf(fo,"//\n"); fprintf(fo,"// File Name:%s \n",argv[3]); fprintf(fo,"//\n"); fprintf(fo,"// Comment : The file is ROM code for inbit#%s and output#%s \n",argv[2],argv[4]); fprintf(fo,"// Project: \n"); fprintf(fo,"// Author : Shamik Valia \n"); fprintf(fo,"// Creation Date : \n"); fprintf(fo,"// Advisor:Prof.Mike Schutle\n"); fprintf(fo,"//\n"); fprintf(fo,"//#########################################################\n"); fprintf(fo,"\n\n"); //Generation of verilog code. fprintf(fo,"module rom(\n "); fprintf(fo," select, //input select line to ROM \n"); fprintf(fo," output1, //output data of the ROM \n"); fprintf(fo," output2, //output data of the ROM \n"); fprintf(fo,"); \n\n"); fprintf(fo,"parameter in_bits = %s, \n",argv[2]); fprintf(fo," out_bits = %s; \n\n",argv[4]); fprintf(fo,"input [in_bits-1:0] select ; \n"); fprintf(fo,"output1[out_bits-1:0] output ; \n\n"); fprintf(fo,"output2[out_bits-1:0] output ; \n\n"); fprintf(fo,"reg [out_bits-1:0] output ;\n\n"); fprintf(fo,"always@(select) \n"); fprintf(fo," case(select) \n"); fprintf(fo," `include \"%s\" \n ",argv[1]); fprintf(fo," endcase \n"); fprintf(fo,"endmodule \n"); } Distributed Arithmetic structure // module struc_16dap(in1,in2,in3,in4,reset,clk,start,out,done); input [15:0] in1,in2,in3,in4 ; input reset,clk,start; output [20:0] out ; output done ; wire [3:0] in1_,in2_,in3_ ,in4_,in5_,in6_,in7_,in8_,in9_,in10_,in11_,in12_,in13_,in14_,in15_,in16_; wire [3:0] in_ROM1,in_ROM2; wire [15:0] out_ROM1,out_ROM2; wire [18:0] ext_out1,ext_out2; wire [20:0] adder_out ; wire reset_reg,shiftAdd,shiftaddcomplement,rd,reset_reg_use; wire [2:0] S1,S2; wire strobe; //instantiation of a DAP controller dap_controller_4 D1(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk); //wire the bits to be put in mux. assign in16_ = {in1[15],in2[15],in3[15],in4[15]}; assign in15_ = {in1[14],in2[14],in3[14],in4[14]}; assign in14_ = {in1[13],in2[13],in3[13],in4[13]}; assign in13_ = {in1[12],in2[12],in3[12],in4[12]}; assign in12_ = {in1[11],in2[11],in3[11],in4[11]}; assign in11_ = {in1[10],in2[10],in3[10],in4[10]}; assign in10_ = {in1[9],in2[9],in3[9],in4[9]}; assign in9_ = {in1[8],in2[8],in3[8],in4[8]}; assign in8_ = {in1[7],in2[7],in3[7],in4[7]}; assign in7_ = {in1[6],in2[6],in3[6],in4[6]}; assign in6_ = {in1[5],in2[5],in3[5],in4[5]}; assign in5_ = {in1[4],in2[4],in3[4],in4[4]}; assign in4_ = {in1[3],in2[3],in3[3],in4[3]}; assign in3_ = {in1[2],in2[2],in3[2],in4[2]}; assign in2_ = {in1[1],in2[1],in3[1],in4[1]}; assign in1_ = {in1[0],in2[0],in3[0],in4[0]}; //select the input to be selected for ROM input mux_8 #(4) M1(in_ROM1,in2_,in4_,in6_,in8_,in10_,in12_,in14_,in16_,S1); mux_8 #(4) M2(in_ROM2,in1_,in3_,in5_,in7_,in9_,in11_,in13_,in15_,S2); //ROM data strcuture. ROM_16 RO1(in_ROM1,out_ROM1,rd); ROM_16 RO2(in_ROM2,out_ROM2,rd); //Sign extension for ROM values assign ext_out1 = {out_ROM1[15],out_ROM1[15],out_ROM1[15],out_ROM1};//higher value assign ext_out2 = {out_ROM2[15],out_ROM2[15],out_ROM2[15],out_ROM2};//lesser value //adder block adder Add1(adder_out,ext_out1,ext_out2,out,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use); assign strobe = 1'b1; reg_ #(21) RE1(out,adder_out,clk,strobe,reset_reg); endmodule module adder(adder_out,in_adder_,in_adder,r,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use); input [18:0] in_adder_,in_adder ; input [20:0] r; input shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use; output [20:0]adder_out; reg [20:0] adder_out; reg [20:0] c,d,e; always@(in_adder,in_adder_,r,shiftAdd,shiftaddcomplement,reset_reg_use) begin if (shiftAdd ==1'b1) begin if (reset_reg_use == 1'b1) begin c = {in_adder_,1'b0,1'b0}; d = {in_adder[18],in_adder,1'b0}; adder_out = {in_adder_,1'b0,1'b0}+{in_adder[18],in_adder,1'b0}; end else begin c = {in_adder_,1'b0,1'b0}; d = {in_adder[18],in_adder,1'b0}; e = {r[20],r[20],r[20:2]} ; adder_out = {in_adder_,1'b0,1'b0}+({in_adder[18],in_adder,1'b0})+({r[20],r[20],r[20:2]}); end end else begin if (shiftaddcomplement ==1'b1) begin adder_out = {(~in_adder_ + 1'b1),1'b0,1'b0} +({in_adder[18],in_adder,1'b0}) + ({r[20],r[20],r[20:2]}); c = {(~in_adder_ + 1'b1),1'b0,1'b0}; d = {in_adder[18],in_adder,1'b0}; e = {r[20],r[20],r[20:2]} ; end else begin c= 20'b0; d = 20'b0; e = 20'b0; adder_out = 20'b0; end end end endmodule module dap_controller_4(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk); input reset ,clk ,start; output reset_reg,shiftAdd,shiftaddcomplement,rd,done,reset_reg_use; output [2:0] S1,S2; reg reset_reg,shiftAdd,shiftaddcomplement,rd,reset_reg_use; reg [2:0] S1,S2; reg [2:0]state,nextstate ; reg done; parameter ST0 = 3'b000; parameter ST1 = 3'b001; parameter ST2 = 3'b010; parameter ST3 = 3'b011; parameter ST4 = 3'b100; parameter ST5 = 3'b101; parameter ST6 = 3'b110; parameter ST7 = 3'b111; always@(posedge clk or posedge reset) begin // reset_reg = 1'b0; // shiftAdd = 1'b0; // shiftaddcomplement =1'b0; // rd = 1'b1; if (reset == 1'b1) begin state = ST0; //reset_reg = 1'b1; rd = 1'b0; reset_reg_use = 1'b1; end else begin case(state) ST0:if(start ==1'b1) begin S1 = 3'b000; S2 = 3'b000; nextstate = ST1 ; reset_reg_use = 1'b1; shiftAdd = 1'b1; rd = 1'b1; done = 1'b0; end else nextstate = ST0; ST1:begin nextstate = ST2; S1 = 3'b001; S2 = 3'b001; shiftAdd = 1'b1; reset_reg_use =1'b0; rd = 1'b1; end ST2 : begin nextstate = ST3; S1 = 3'b010; S2 = 3'b010; shiftAdd = 1'b1; reset_reg_use =1'b0; rd = 1'b1; end ST3 : begin nextstate = ST4; S1 = 3'b011; end S2 = 3'b011; shiftAdd = 1'b1; reset_reg_use =1'b0; rd = 1'b1; ST4 : begin end ST5 : begin end nextstate = ST5; S1 = 3'b100; S2 = 3'b100; shiftAdd = 1'b1; reset_reg_use =1'b0; rd = 1'b1; nextstate = ST6; S1 = 3'b101; S2 = 3'b101; shiftAdd = 1'b1; reset_reg_use =1'b0; rd = 1'b1; ST6 : begin end nextstate = ST7; S1 = 3'b110; S2 = 3'b110; shiftAdd = 1'b1; reset_reg_use =1'b0; rd = 1'b1; ST7 : begin end nextstate = ST0; S1 = 3'b111; S2 = 3'b111; //shiftAdd = 1'b1; shiftaddcomplement = 1'b1; rd = 1'b1; done =1'b1; default: nextstate = ST0; endcase state = nextstate ; end end endmodule /* module struc_16; wire [15:0] in1,in2,in3,in4; wire [20:0] out ; reg clk , reset ,start ; reg [63:0] t; wire [31:0] out1; wire [20:0] value; wire [15:0] error; wire done; reg val; assign in1=t[15:0]; assign in2=t[31:16]; assign in3=t[47:32]; assign in4=t[63:48]; check16 c1(in1,in2,in3,in4,out1); struc_16dap D1(in1,in2,in3,in4,reset,clk,start,out,done); assign value = out<<1; //assign error = (val==1'b1) ? (out1[30:15] - out[16:1]) : 16'b0; assign error = out1[30:15] - out[16:1]; always #5 clk = ~clk ; always begin # 5 start = 1'b1;val = 1'b0; t = t+1; # 10 start =1'b0; # 95 val=1'b1; end initial begin clk = 1'b1;reset =1'b1;t = 64'h0000_0000_0000_0000; #7 reset =1'b0; end endmodule */