Performance Tuning Project Technion Softlab Submitted By: Eyal Segal Koren Shoval Supervisors: Liat Atsmon Koby Gottlieb Spring 2009 1 Table of Contents 1. Introduction .......................................................................................................................................... 4 1.1 Introduction.................................................................................................................................. 4 1.2 About WavPack ............................................................................................................................ 4 1.3 Project goals ................................................................................................................................. 5 2. WavPack ................................................................................................................................................ 6 2.1 Wave file format........................................................................................................................... 6 2.2 WavPack file format ..................................................................................................................... 8 2.2.1 Description ............................................................................................................................ 8 2.2.2 Block header ......................................................................................................................... 8 2.2.3 Metadata sub-blocks ............................................................................................................ 9 2.2.4 Metadata tags ....................................................................................................................... 9 2.3 Algorithm and Program flow ........................................................................................................ 9 3. Benchmark .......................................................................................................................................... 12 3.1 Testing Environment .................................................................................................................. 12 3.1.1 Hardware ............................................................................................................................ 12 3.1.2 Software .............................................................................................................................. 12 3.2 Testing Case In Wavpack ............................................................................................................ 12 3.3 Running WavPack with VTune ................................................................................................... 12 3.4 Original Version Performance .................................................................................................... 12 3.5 Conclusions and objectives ........................................................................................................ 14 4. First Optimization - Parallel IO/CPU.................................................................................................... 15 4.1 Description ................................................................................................................................. 15 4.2 Performance Testing .................................................................................................................. 17 4.3 Conclusions................................................................................................................................. 17 5. Second Optimization - Multi Threaded Processing ............................................................................ 18 5.1 Description ................................................................................................................................. 18 5.2 Performance Testing .................................................................................................................. 21 5.3 Conclusions................................................................................................................................. 22 2 6. Third Optimization - Moving to SIMD ................................................................................................. 23 6.1 Description ................................................................................................................................. 23 6.2 Performance Testing .................................................................................................................. 24 6.3 Conclusions................................................................................................................................. 24 7. Fourth Optimization - Implementation Improvements ...................................................................... 25 7.1 Description ................................................................................................................................. 25 7.2 Performance Testing .................................................................................................................. 26 7.3 Conclusions................................................................................................................................. 27 8. Optimization Summary ....................................................................................................................... 28 9. Appendix A – Blocking Queue ............................................................................................................. 30 10. Appendix B – Thread Pool ................................................................................................................... 31 11. Appendix C – SIMD.............................................................................................................................. 32 12. References .......................................................................................................................................... 33 3 1. Introduction 1.1 Introduction Many of the open source applications are single-threaded, and not yet optimized for modern multi-core processors. Performance Tuning projects attempts to improve these applications by introducing multithreading and rewriting code using new SIMD instructions. This project deals with the performance tuning of WavPack which is an open source, lossless encoder that converts wave audio files to WV files. As required WavPack is single-threaded written in C and incorporates some MMX instructions from several years ago. 1.2 About WavPack WavPack is a completely open audio compression format providing lossless, high-quality lossy, and a unique hybrid compression mode. WavPack compresses WAV files into WV files. In the default lossless mode, WavPack acts just like a WinZip compressor for audio files. However, unlike MP3 or WMA encoding which can affect the sound quality, not a single bit of the original information is lost, so there's no chance of degradation. This makes lossless mode ideal for archiving audio material or any other situation where quality is paramount. The compression ratio depends on the source material, but generally is between 30% and 70%. The hybrid mode provides all the advantages of lossless compression with an additional bonus. Instead of creating a single file, this mode creates both a relatively small, high-quality lossy file that can be used all by itself, and a "correction" file that (when combined with the lossy file) provides full lossless restoration. Wavpack is being supported by some well-known Windows software, and also making its way into Linux and MAC territory. For example, there is a Wavpack plug-in for Winamp, Nero, and more. If someone wants to play Wavpack files in Windows Media Player for example, it would be enough to install ffdshow filter in order to allow it. In addition to software support, Wavpack is also supported by hardware devices, such as mobile phones, portable music players, and more. For example, all new Nokia mobile phones, that includes Symbian S60 3rd OS, can play Wavpack files. In the music players section, players like Cowon A3 PMP, iRiver, iPod and many more support Wavpack files. 4 1.3 Project goals The main goal of this project is to enhance the Wavpack application performance, in order to achieve a good speedup comparing to the original application. Of course, output results must stay the same as it was. Working and analyzing with Intel® VTune™ Performance Analyzer, in order to find potential spots for performance enhancement. Learn and use new instructions of Intel® new Core I7 processor. Implementing multi-threading techniques in order to achieve high performance. After the project is completed, return the improved Wavpack application to the open source community. 5 2. Wavpack 2.1 Wave file format WAV (or WAVE), short for Waveform audio format, also known as Audio for Windows, is a Microsoft and IBM audio file format standard for storing an audio bitstream on PCs. It is an application of the RIFF bitstream format method for storing data in “chunks". Though a WAV file can hold compressed audio, the most common WAV format contains uncompressed audio in the linear pulse code modulation (LPCM) format. A RIFF file starts out with a file header followed by a sequence of data chunks. The general structure of a RIFF format describes a list of chunks. A chunk is a list or a data. A list node contains an id, the size of the following sub chunks and it's type. In the wave format described on the right, you can see that the riff header is a list chunk. The second chunk is a data chunk specifying the data format and the third is a data chunk containing the actual sample data. In the wave format, each chunk size is even, which means that 1 byte is padded if length is odd. There may be additional Figure 2.1: wave file format sub chunks in a Wave file, and even more than one wave data chunk. Riff chunk – header of the WAVE format: Name ChunkID ChunkSize Format Description Contains the letters "RIFF" in ASCII form (0x52494646 big-endian form). This is the size of the entire file in bytes minus 8 bytes for the two fields not included the Riff header's ChunkID and ChunkSize. Meaning 36 + SubChunk2Size. Contains the letters "WAVE" (0x57415645 big-endian form) Format sub-chunk - describes the sound data's format: Name Subchunk1ID Subchunk1Size AudioFormat Description Contains the letters "FMT" (0x666d7420 big-endian form) 16 for PCM. This is the size of the rest of the Sub chunk which follows this number PCM = 1 (i.e. Linear quantization) Values other than 1 indicate some form of 6 NumChannels SampleRate ByteRate BlockAlign BitsPerSample compression Mono = 1, Stereo = 2, etc. 8000Hz, 44100Hz, etc. Avarage byte rate (generally SampleRate NumChannels BitsPerSample / 8 ) NumChannels BitsPerSample / 8 8 bits, 16 bits, etc. Data sub-chunk - contains the size of the data and the actual sound: Name Subchunk2ID Subchunk2Size Description Contains the letters "DATA" (0x64617461 big-endian form) The number of bytes in the data. It's the size of the data block following this number. (generally it's NumSamples NumChannels BitsPerSample / 8 ) Data The actual data 7 2.2 Wavpack file format 2.2.1 Description Wavpack file consists of a series of Wavpack audio blocks. Every block contains “metadata” - information about sound data, including sampling rate, channels, bits per sample, and more. Metadata may also contain different coefficients using for restoring samples, correction bitstream and actual compressed samples. The Wavpack blocks are very easy to identify by their unique header data, which contains among the rest information about the total size of the block and the audio format that is stored. These blocks are completely independent in that they can be decoded all by themselves. They may contain up to 131072 samples, either stereo or mono, and can be lossless or lossy. An additional format is the correction file (.wvc) that has an identical structure to the main file. There is a one-to-one correspondence between main file blocks that contain audio and their correction file match. The only difference is in the headers of the blocks – the block’s size and the CRC value. In order to allow reduced memory requirements (mostly for hardware devices), it is possible to decode regular Wavpack files without buffering an entire block. 2.2.2 Block header Here is the 32-byte little-endian header description at the front of every Wavpack block: Size 4 bytes 32 bits 16 bits 8 bits 8 bits 32 bits Name ckID ckSize version track_no index_no total_samples 32 bits 32 bits 32 bits 32 bits block_index block_samples flags crc Description Block id ('wvpk') Total block size (not including this field or 'wvpk') Current valid versions are 0x402 - 0x410 Track number (not currently implemented) Track sub index (not currently implemented) Total samples in entire file (valid if block_index=0. A value of -1 indicates unknown length) Index of first sample in block relative to beginning of file Samples in this block (0 means no audio present) Various flags for id and decoding CRC for actual decoded data 8 2.2.3 Metadata sub-blocks Following the 32-byte header to the end of the block are a series of “metadata” sub-blocks. These sub-blocks contain extra information needed to decode the audio, but may also contain user information that is not required for decoding. The final sub-block is usually the compressed audio bitstream itself. The format of the metadata is: Size 8 bits Name id 8/24 bits 16 bits word_size / word_size[3] data[word_size] Description 4 masks available: 0x1f - metadata function, 0x20 - decoder doesn’t need to understand metadata, 0x40 - data length is less than 1, 0x80 - large block (> 255 words) small block: data size in words / large block: data size in words data, padded to an even number of bytes The more relevant metadata ids available: ID_DUMMY - could be used to pad Wavpack blocks ID_WV_BITSTREAM - normal compressed audio bitstream (wv file) ID_WVC_BITSTREAM - correction file bitstream (wvc file) ID_WVX_BITSTREAM - special extended bitstream for floating point data or long integers (> 24 bit) ID_RIFF_HEADER - RIFF header for .wav files (before audio) ID_RIFF_TRAILER - RIFF trailer for .wav files (after audio) 2.2.4 Metadata tags These tags are special tags for storing user data such as artist, title, album, track, year, etc. 2.3 Algorithm and Program flow Wavpack is an open audio compression algorithm and an open source software implementation that supports three compression modes - lossless, lossy and a unique hybrid compression mode. The project’s focus is on the lossless stereo mode in which the audio samples are simply compressed at their full resolution and no information is discarded along the way. The basic algorithm has three main parts: 1. A joint stereo processing which removes inter-channel correlations. 2. Multipass decorrelations which removes intra-channel correlations between neighboring audio samples. 9 3. An entropy encoder used to compress the data. The input stream is partitioned into blocks that can be either mono or stereo and are about 0.5 seconds long. For each of these blocks, the first step is to convert the left and right channels into difference and average (also referred to as side and mid). The second step is prediction. This is where multiple passes are done for each block and using a set of filters and an adaptation LMS algorithm. Wavpack allows between 2 and 16 passes (default is five passes). Finally, the weight is updated for the next sample based on the signs of the filter input and the residual. In the lossless mode, the results of the decorrelation (the residuals or weights) of all the passes are passed to the entropy coder for exact translation. The entropy coder uses variations on the standard algorithms Elias and Golomb to produce the compressed lossless output. The implementation is very similar to the description of the algorithm. It goes through the samples performing the first step in large blocks (i.e. computing side/mid & handling different sampling sizes). It then iterates on the 0.5 second blocks which are 24,000 samples (48,000 is the amount of samples per second in a CD quality stereo file). It performs the second step (i.e. the multiple passes) and finally it compresses the received blocks. The weights are passed down to the functions via the global context, which contains the bit stream as well as the additional information for each pass. 10 Figure 2.2: a diagram describing the flow of the original implementation. 11 3. Benchmark 3.1 Testing Environment 3.1.1 Hardware The project was developed and tested on a Core i7 2.66GHz CPU and a Quad6600 2.4GHz both with 4GB of ram. 3.1.2 Software Windows XP/Vista with Visual studio 2008 for development and debugging, and Intel VTune Toolkit for performance testing, memory leaks and thread checks. The project is compiled with Microsoft compiler. 3.2 Testing Case In Wavpack Wavpack program supports many different decoding & encoding options. We have decided to improve one mode – lossless stereo mode. In order to test this mode, we used a 330Mb WAV file with two channels (stereo). Note: Wavpack can process a 330Mb WAV file in about 30 seconds. 3.3 Running WavPack with VTune VTune is an Intel software designed to measure CPU behavior while running a specified process. It can integrate with a visual studio solution and display hot spots on methods in the code, according to the CPU load. For our purpose, we mostly used the measures: CPI, cache misses, branch misses. 3.4 Original Version Performance As noted above, processing a 330MB file takes about 30 seconds. In Figure 3.1 you can see the results of VTune analysis on the original program – notice that there are three major functions which consume most of the CPU time: Figure 3.1: VTune hotspots analysis results. 12 ‘decorr_stereo_pass_id2’ – a mathematical function for processing the WAV blocks to WV blocks. ‘send_words_lossless’ – a significant part of block processing, especially contains logic as a preparation for file writing. ‘flush_word’ – a function for writing data (bits) to file. Figure 3.2: VTune analysis call graph. In addition, you can visually see the functions that consume the majority of the CPU time in figure 3.2 (marked in red). Notice that the three functions that mentioned above are called for each block (the function ‘flush_word’ is mostly called within the function ‘send_words_lossless’). 13 3.5 Conclusions and Objectives At first glance, the wavpack encoding algorithm is made up of several steps that split the data into many elements and process each of them separately, but unfortunately each "small" element of data is dependent on the previous one via the wavpack context. This negates the possibility of parallelizing the entire flow completely without changing the algorithm itself. Also, there are three functions or code segments that are worth improving since they are the most CPU time consumers in the program. Other segments in the code might be improvable, but probably aren’t worth the effort. With that in mind, we decided to try several optimizations, each independent from the other. Generally, our objectives were to: 1. Parallelize the read/write/process operations and gain a few seconds (this improvement will be close to constant as files grow larger since IO is always slower the CPU). 2. Try to parallelize a segment of the code or a flow of several functions instead of the entire program. 3. Introduce SIMD into the code, using these instructions mainly on loops. In addition, attempt to take several bytes of "audio" at a time from the buffer and calculate the output at once (Note: there are some usages of these features but with older versions - several years old). 4. Try to make some implementation improvements. We detected some functions that we might be able to unroll (not necessarily with SIMD) and functions (math based mostly) with potential for significant improvement (this will be explained in detail at section 7). Of course, according to the VTune analysis, our main focus will be on the three major functions we found. 14 4. First Optimization - Parallel IO/CPU 4.1 Description Our first attempt was to parallelize I/O operations (read, write) and CPU operations. Since the algorithm of the original program includes file processing by blocks, and therefore includes a read operation from the source file for each block and write operation as well, we concluded that we can parallel the reading/writing operations, and by that increase the program’s performance. Since reading/writing operations are not the bottle neck of the application, there is no point in reading or writing with more than one thread for each operation. With this conclusion, we chose to implement the multithreading idea with two threads (except the main thread) - one for block processing and one for writing the processed blocks to file. The reading part is done by the main thread. The communication between the threads is done by using two queues (see Appendix A) of jobs. The first queue is for blocks, waiting to be processed after reading from file was done. The second queue is for processed blocks, waiting to be written. For each block we read by the main thread, we read it with the same algorithm as the original, and then we enqueue it into the first queue. The worker thread, responsible for processing these waiting jobs, recognize that it has a waiting job, pulls it off the queue, and starts with the processing stage. In the original program, the writing to the file is done while processing the blocks. In order to achieve parallelization, we changed the destination of the writing action to a temporary buffer instead of the file. In this way, we created processed blocks, which now can be enqueued to the second queue, and wait for the writing thread to do its job. The writing thread pulls off the current job, and starts working, meaning it writes the block into the destination file. This thread is also responsible for displaying progress to screen, and free all allocated memory. This algorithm continues until all blocks have been read, processed, and written into the destination file. 15 In figure 4.1 you can see the new program flow as a result of our improvement. Figure 4.1: detailed program flow after parallelizing CPU & I/O operations. With this implementation, the tasks are running simultaneously – we can read new blocks while processing other ones. The writing tasks can also be performed while the other two tasks are still running. 16 4.2 Performance Testing We ran VTune analysis with this optimization only. The results were: Figure 4.2: VTune analysis results. The main thread (reader) is the second in the table. We can see that it takes about 3.83% of the total processing time. The writing thread used less than 1% of CPU time. 4.3 Conclusions From the above results, we learn that no significant improvement can be achieved here. The reason for that is the fact that the I/O operations take considerably less time than the blocks processing and the main thread(reader) finishes his work a lot sooner than the processing thread. In addition, the writing thread is almost never busy, since the time that it takes to write the processed blocks to the file is negligible to the time that it takes to process these blocks. Therefore, we can see that the writing thread did virtually nothing. If we take a look at the table results, we can see that the reading took about 3.83% of the total running time. This percentage comes to about 1 second of improvement and the total speedup 29.625 we get here is 28.859 = 1.026. Considering the implementation time and program readability vs. performance improvement, this optimization wasn’t worthwhile. 17 5. Second Optimization - Multi Threaded Processing 5.1 Description After trying to parallelize the I/O operations, we wanted to find other areas to parallelize. The next step was to parallelize the processing stage, by creating multiple threads, and letting each one do some section of the code. As said before, the blocks are dependent and must be processed sequentially, and therefore we can’t process two blocks in two different threads simultaneously. Once we concluded that we can’t parallelize the entire flow, we searched in the program for parts that take most of the CPU time, and can be parallelized efficiently. In our search, we used Intel® VTune™ Performance Analyzer (see section 3.2). After resolving relevant code sections for multithreading, our method was to create a task entry function and structure of relevant data for each thread. The structure is filled dynamically with updated data, and holds other useful information that the thread needs for running (See figure 5.1). Some of the fields of this structure contains data that has to be processed. // setup thread #1 args. args_0.dpps = wps->decorr_passes; args_0.terms = wps->num_terms; args_0.sample_count = sample_count; args_0.buffer = buffer; args_0.flags = flags; // setup thread #2 args. args_1.dpps = dpps_tmp; memcpy(args_1.dpps, wps->decorr_passes, sizeof(struct decorr_pass)*MAX_NTERM); args_1.terms = wps->num_terms; args_1.sample_count = sample_count; args_1.buffer = buffer_tmp; memcpy(args_1.buffer, buffer, INPUT_SAMPLES*sizeof(int32_t*)); args_1.flags = flags; Figure 5.1: code snippet from the function pack_samples – creating structures for the threads The problem with such data is that every thread is working on different part of this data, but still has to access the same memory addresses. For example, in some section of the code we work on samples buffer, with two threads. One thread is running on even samples, while the other thread is running on odd samples. If those two threads were running on the same buffer (same memory address, offset is always from the start), even though they would be running on different data we would still get memory sharing conflicts and lower performance. In order to improve the performance and create independent jobs for the threads, we create a copy of the 18 whole data. Now, each thread is working on its own data, and there is no data sharing conflict between them. Because we created copies of some parts of the data, we have to wait until all the threads have finished their jobs, merge the results of each thread into the original buffer, and free all allocated memory. In order to reduce multithreading creation overhead, we used a thread pool (see Appendix B). When we want to parallelize some sections of the code, we use the method described above, and then submit the job to the thread pool (See figure 5.2). An available thread in the pool “takes” the job, and starts working. In this way, there is no need to create the thread for each job we want to run, and we use the same thread for multiple jobs. // make them run through thread pool // thread1 will run the even samples submitWork(tp, run_dpp_0, &args_0, decorrLock); // thread2 will run the odd samples submitWork(tp, run_dpp_1, &args_1, decorrLock); // the calling thread will wait for both waitForHandle(decorrLock, 2); Figure 5.2: code snippet from the function pack_samples – submitting jobs to thread pool and wait for job to be finished. 19 In figure 5.3 you can see the new program flow after the second optimization: Figure 5.3: detailed program flow after parallelizing specific code sections. 20 5.2 Performance Testing We ran VTune analysis with this optimization only. The results were: Figure 5.4: VTune analysis results. In figure 5.4, we can see all the threads and their running analysis. The first thread here is the main thread. The other two are threads from the “thread pool”, and they are triggered by the main thread, to do the multithreaded jobs. Here we can see (figure 5.6, figure 5.7) that the “pool threads” were working on the function ‘decorr_stereo_pass_id2’ (we split this function into two separate indexed functions). Figure 5.5: running results of the main thread (thread ID 4680). Figure 5.6: running results of the first “pool thread” (thread ID 4276). Figure 5.7: running results of the second “pool thread” (thread ID 4272). 21 5.3 Conclusions In comparison to original results, we can see (figures 5.6, 5.7) that each thread runs about half of the time of the original ‘decorr_stereo_pass_id2’ running time (with single thread). Since the original measured time was about 9.77 seconds (33% of total time of 29.625), we can expect for an improvement of 9.77 2 = 4.885 seconds. The new measured time is 25.375 seconds, which is 4.25 seconds less than the original time. If we consider the multithreading overhead, than it seems the results and expectations match. 29.625 In summary, the total speedup we achieved here is 25.375 = 1.167. This gives us about 16% improvement in the performance. 22 6. Third Optimization - Moving to SIMD 6.1 Description One of the advantages of the Intel® Core™ i7 processor is a new set of instructions (SIMD – see Appendix C) that operate on 128 bit of data. In order to use these instructions efficiently, we searched in the program for mathematical code sections that have many calculations and repetitions, such as loops. The idea was to try to convert four operations of 32 bits to one operation of 128 bits. Theoretically, with this method, the performance of each section can be x4 faster. In the Wavpack program, there are some sections of code that includes mathematical calculations inside “for” loops. These loops are repeated tens of thousands of times, therefore this was the right place to try using SIMD instead of the original implementation. Since the original data size in each step was 32 bits, and the 128 bits instruction operates on four elements with size of 32 bits, we had to find a way to create four independent calculations. To do that, we used loop unrolling method, so now in each step of the loop we calculate four steps. Inside the loop, we had to load the relevant data to 128 bit registers, and implement the same calculations as the original program does. After the data was processed, we have to save the data from the registers back to the buffer. In between, we used several functions that operates on 128 bit registers. In the following example (figure 6.1), we used SSE2 instructions that operates on 128 bit. __m128i sam1, sam2, tmp; … //this code uses SSE2 assembler instructions. in addition, we added loop unrolling for (bptr = buffer; bptr < tmp_eptr; bptr += 16) { // set all initial integers for the calculations, including two loops sam1 = _mm_set_epi32(bptr[4], bptr[2], bptr[0], dpp->samples_A[0]); tmp = _mm_set_epi32(bptr[2], bptr[0], dpp->samples_A[0], dpp->samples_A[1]); sam2 = _mm_set_epi32(bptr[12], bptr[10], bptr[8], bptr[6]); sam1 = _mm_slli_epi32(sam1, 1); //multiply by 2 sam1 = _mm_sub_epi32(sam1, tmp); // sub tmp from 2*sam1 // set integers for second loop tmp = _mm_set_epi32(bptr[10], bptr[8], bptr[6], bptr[4]); sam2 = _mm_slli_epi32(sam2, 1); //multiply by 2 sam2 = _mm_sub_epi32(sam2, tmp); // sub tmp from 2*sam2 … } Figure 6.1: code snippet from the function decorr_stereo_pass_id2_0 23 Each variable contains four independent elements of 32 bits, and the calculations are done on all of the elements simultaneously. In addition, we did an extra loop unrolling, therefore in each iteration we process eight steps. Since this specific example is about processing even indexes of the buffer, we increment the buffer pointer by 16. Because of the method of loop unrolling, and the fact that we don’t know the buffer’s size, we have to consider the part of the buffer that weren’t processed (modulo 16). For that we added another “for” loop at the end of the described one, with the original code, that calculates the last part of the buffer. 6.2 Performance Testing Note: We implemented this optimization based on the multithreaded code. In order to check this optimization independent from the others, we run the work of the two threads serially. The analysis result: Figure 6.2: VTune analysis with SIMD optimizations results. We can see that the improvement is achieved in the function ‘decorr_stereo_pass_id2’. In the original program, it took 33% of the total time, and in this run it took about 30%. The number of retired instruction is about 33,088,500,000. The total runtime is 28.39 seconds. 6.3 Conclusions This optimization saved us 3% of the total runtime, which concludes to 28.39 seconds, instead of 29.625 in the original application. Also, we can see here reduction in the number of retired instructions by 4,698,000,000 instructions. This is probably a result of using SIMD instructions, and loop unrolling methods, and that is mostly the reason for improvement at all. The speedup we got here is 29.625 28.39 = 1.043. This optimization alone isn’t improving performance significantly. 24 7. Fourth Optimization - Implementation Improvements 7.1 Description After applying the optimizations discussed above, we still thought we can obtain further improvement in speed. We decided to go over hot spots in the code and reimplement in a more efficient way parts of the code. The changes we made were according to VTune, locating several places in the code where there are heavy mathematical calculations or high branch prediction misses. In this document, we chose to include functions that were critical in terms of improvement. According to VTune, we were still getting hot spots in the "Flush Word" function, and specifically in the macros it was using (Code snippet is shown in figure 7.1). #define putbit_1(bs) { (bs)->sr |= (1 << (bs)->bc); \ if (++((bs)->bc) == sizeof (*((bs)->ptr)) * 8) { \ *((bs)->ptr) = (bs)->sr; \ (bs)->sr = (bs)->bc = 0; \ if (++((bs)->ptr) == (bs)->end) (bs)->wrap (bs); \ }} #define putbits(value, nbits, bs) { \ (bs)->sr |= (int32_t)(value) << (bs)->bc; \ if (((bs)->bc += (nbits)) >= sizeof (*((bs)->ptr)) * 8) \ do { \ *((bs)->ptr) = (bs)->sr; \ (bs)->sr >>= sizeof (*((bs)->ptr)) * 8; \ if (((bs)->bc -= sizeof (*((bs)->ptr)) * 8) > 32 sizeof (*((bs)->ptr)) * 8) \ (bs)->sr |= ((value) >> ((nbits) - (bs)->bc)); \ if (++((bs)->ptr) == (bs)->end) (bs)->wrap (bs); \ } while ((bs)->bc >= sizeof (*((bs)->ptr)) * 8); \ } Figure 7.1: putbit_1, putbits macros code snippets. This function is used to put bits into the output buffer. It stores '1' and '0' into a temporary variable of int32 and after 16 bits have been set, it sends the bits to the output buffer. The macros it uses are putbit, putbits, putbit_0, putbit_1. Since these macros use large amount of branches which are depend on the input file, reducing these branches will clearly give a performance boost. We finally implemented two changes: the first was reimplementing the macros to use no branches and the second was to use a variable of int64 to write the bits in larger blocks. 25 #define putbit_1_opt(bs) { \ (bs)->num |= (uint64_t)1 << (bs)->bc; \ ++((bs)->bc); \ } #define putbits_opt(value, nbits, bs) { \ (bs)->num |= (uint64_t)(value) << (bs)->bc; \ (bs)->bc += nbits; \ } Figure 7.2: code snippet – putbit_1, putbits macros after optimization. 7.2 Performance Testing One of the main reasons for implementing this optimization is branch mispredicting. In order to show the difference between the misprediction in the original program and the optimized program, we’ll first run a branch misprediction analysis on the original program: Figure 7.3: Original program - VTune analysis for branch misprediction. In the function ‘flush_word’, we can see that there are about 78,420,338 mispredicted branches, from a total of 1,713,906,754 branch instructions. Now let’s run the optimized program, and check its results: Figure 7.4: Optimized program - VTune analysis for branch misprediction. We can see a drastic reduction in the branch misprediction in the optimized function – ‘flush_word_opt’ – about fifth of the original (15,435,072). Also, the total number of the branch instructions is less than the number in the original program – about 400 million events less. Another result of this optimization is that the function ‘send_words_lossless’ has more branch events – about 800 million more, but more than half of the mispredicted branches in the original program are now predicted correctly. Another optimization that improves performance is the fact that we used 64 bit data elements instead of 32 bit. 26 All these improvements conclude to a total runtime of 27.937 seconds. 7.3 Conclusions As a result of branch and branch misprediction reduction, and using 64 bit integer instead of 32 bit integer, we can see an improvement in performance – almost 2 seconds less from the original program. It seems that we could use 128 bit with SIMD instructions, but then the use of the 128 bit registers will cause too much overhead and we won’t achieve any speedup. For that reason, reimplementing with 64 bit integer which interprets to 64 bit register was the best choice here. 29.625 The total speedup we got here Is 27.937 = 1.06. 27 8. Optimization Summary Here we can see all the threads that run in the total optimized application: Figure 8.1: VTune analysis results of all optimizations together 1. 2. 3. 4. 5. Main thread (Thread ID 4024) “Pool thread” (Thread ID 3900) – working on the left channel, as described in section 5. “Pool thread” (Thread ID 7124) – working on the right channel, as described in section 5. Reading thread (Thread ID 6596) – reading the blocks, as described in section 4. Writing thread (Thread ID 2984) – writing the blocks, as described in section 4. Here, we can see that each thread is using different core. In this way, it is assured that the multithreading is most efficient. Figure 8.2: VTune analysis results of all optimizations together, threads with CPU info 28 In figure 8.3, we can see each optimization speedup, and the total speedup. It seems that the most significant optimization was the code sections multithreading, with 16% speedup, while the most insignificant was the multithreaded I/O, with 2.6% speedup. Optimizations Steps over Speed Up 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Speed Up [%] Figure 8.3: VTune analysis results of all optimizations together The total speedup we achieved is 29.625 22.187 = 1.335, meaning the program runs faster by 33.5%. 29 9. Appendix A – Blocking Queue The blocking queue is an ADT which blocks incoming calls to dequeue until there is an item in the queue to return or to enqueue when there is no more room. It's initialized with a size parameter and contains an array of items of that size, as well as some semaphores to maintain the blockings. When dequeue is invoked, the queue will block the calling thread until it has an item in the queue. When the queue will have an item to return, it will return it. When enqueue is invoked, the queue will add the item to the queue, unless the queue is full, in which case the calling thread will be blocked until the queue has room for the incoming job. The queue also provides an additional feature of "terminating" the queue. This will let the queue know that there are no more incoming items. When the next dequeue comes, if the queue is empty it will not lock the caller, but return "terminated". Calling delete_queue will release its resources. 30 10. Appendix B – Thread Pool The thread pool is a singleton ADT. It's created with a constant number of threads, and consists of a blocking queue for incoming jobs. All the pool's threads start in a loop waiting on the queue's dequeue method for an incoming job. To submit a job to the pool, the current thread creates a "thread job" which stores the function pointer of what to do, the arguments and a mutex object to synchronize on. Once the job have been submitted, the next thread waiting on the queue will be released from the queue's lock and the job will be returned to it. The thread will run that job, and when done, release the mutex object and free the job's resources and return to wait on the queue for another job. If all the threads are working, the jobs will wait in the queue. When the application ends, it calls the pool's delete method to close the pool's threads and to release any resources left. 31 11. Appendix C – SIMD SIMD stands for Single Instruction Multiple Data. It means that we can do operations on N elements with one instruction. In our project we use Intel Core i7 processor, which has support for 128 bit registers. With these registers, we can do four operations on 32 bit elements, or two operations on 64 bit elements, simultaneously. This can be significant while trying to achieve performance speedup. In our project, since most of our SIMD improvements were on mathematical calculations, we used mostly SSE, SSE2 instructions. The way of using those instructions is by using ‘intrinsic’ type – a wrapper for SIMD instruction implemented in visual c++. 32 12. References http://www.wavpack.com http://sourceforge.net/ http://softlab.technion.ac.il/ http://msdn.microsoft.com http://en.wikipedia.org/wiki/ http://www.google.co.il/ http://www.intel.com/ 33