Wavpack

advertisement
Performance Tuning Project
Technion Softlab
Submitted By:
Eyal Segal
Koren Shoval
Supervisors:
Liat Atsmon
Koby Gottlieb
Spring 2009
1
Table of Contents
1. Introduction .......................................................................................................................................... 4
1.1
Introduction.................................................................................................................................. 4
1.2
About WavPack ............................................................................................................................ 4
1.3
Project goals ................................................................................................................................. 5
2. WavPack ................................................................................................................................................ 6
2.1
Wave file format........................................................................................................................... 6
2.2
WavPack file format ..................................................................................................................... 8
2.2.1
Description ............................................................................................................................ 8
2.2.2
Block header ......................................................................................................................... 8
2.2.3
Metadata sub-blocks ............................................................................................................ 9
2.2.4
Metadata tags ....................................................................................................................... 9
2.3
Algorithm and Program flow ........................................................................................................ 9
3. Benchmark .......................................................................................................................................... 12
3.1
Testing Environment .................................................................................................................. 12
3.1.1
Hardware ............................................................................................................................ 12
3.1.2
Software .............................................................................................................................. 12
3.2
Testing Case In Wavpack ............................................................................................................ 12
3.3
Running WavPack with VTune ................................................................................................... 12
3.4
Original Version Performance .................................................................................................... 12
3.5
Conclusions and objectives ........................................................................................................ 14
4. First Optimization - Parallel IO/CPU.................................................................................................... 15
4.1
Description ................................................................................................................................. 15
4.2
Performance Testing .................................................................................................................. 17
4.3
Conclusions................................................................................................................................. 17
5. Second Optimization - Multi Threaded Processing ............................................................................ 18
5.1
Description ................................................................................................................................. 18
5.2
Performance Testing .................................................................................................................. 21
5.3
Conclusions................................................................................................................................. 22
2
6. Third Optimization - Moving to SIMD ................................................................................................. 23
6.1
Description ................................................................................................................................. 23
6.2
Performance Testing .................................................................................................................. 24
6.3
Conclusions................................................................................................................................. 24
7. Fourth Optimization - Implementation Improvements ...................................................................... 25
7.1
Description ................................................................................................................................. 25
7.2
Performance Testing .................................................................................................................. 26
7.3
Conclusions................................................................................................................................. 27
8. Optimization Summary ....................................................................................................................... 28
9. Appendix A – Blocking Queue ............................................................................................................. 30
10. Appendix B – Thread Pool ................................................................................................................... 31
11. Appendix C – SIMD.............................................................................................................................. 32
12. References .......................................................................................................................................... 33
3
1. Introduction
1.1 Introduction
Many of the open source applications are single-threaded, and not yet optimized for modern
multi-core processors. Performance Tuning projects attempts to improve these applications by
introducing multithreading and rewriting code using new SIMD instructions.
This project deals with the performance tuning of WavPack which is an open source, lossless
encoder that converts wave audio files to WV files. As required WavPack is single-threaded
written in C and incorporates some MMX instructions from several years ago.
1.2 About WavPack
WavPack is a completely open audio compression format providing lossless, high-quality lossy,
and a unique hybrid compression mode. WavPack compresses WAV files into WV files.
In the default lossless mode, WavPack acts just like a WinZip compressor for audio files.
However, unlike MP3 or WMA encoding which can affect the sound quality, not a single bit of
the original information is lost, so there's no chance of degradation. This makes lossless mode
ideal for archiving audio material or any other situation where quality is paramount. The
compression ratio depends on the source material, but generally is between 30% and 70%.
The hybrid mode provides all the advantages of lossless compression with an additional bonus.
Instead of creating a single file, this mode creates both a relatively small, high-quality lossy file
that can be used all by itself, and a "correction" file that (when combined with the lossy file)
provides full lossless restoration.
Wavpack is being supported by some well-known Windows software, and also making its way
into Linux and MAC territory. For example, there is a Wavpack plug-in for Winamp, Nero, and
more. If someone wants to play Wavpack files in Windows Media Player for example, it would
be enough to install ffdshow filter in order to allow it.
In addition to software support, Wavpack is also supported by hardware devices, such as
mobile phones, portable music players, and more. For example, all new Nokia mobile phones,
that includes Symbian S60 3rd OS, can play Wavpack files. In the music players section, players
like Cowon A3 PMP, iRiver, iPod and many more support Wavpack files.
4
1.3 Project goals
 The main goal of this project is to enhance the Wavpack application performance, in order
to achieve a good speedup comparing to the original application. Of course, output results
must stay the same as it was.
 Working and analyzing with Intel® VTune™ Performance Analyzer, in order to find potential
spots for performance enhancement.
 Learn and use new instructions of Intel® new Core I7 processor.
 Implementing multi-threading techniques in order to achieve high performance.
 After the project is completed, return the improved Wavpack application to the open
source community.
5
2. Wavpack
2.1 Wave file format
WAV (or WAVE), short for Waveform audio format, also known as Audio for Windows, is a
Microsoft and IBM audio file format standard for storing an audio bitstream on PCs. It is an
application of the RIFF bitstream format method for storing data in “chunks".
Though a WAV file can hold compressed audio, the most common WAV format contains
uncompressed audio in the linear pulse code modulation (LPCM) format.
A RIFF file starts out with a file header
followed by a sequence of data chunks.
The general structure of a RIFF format
describes a list of chunks. A chunk is a list
or a data. A list node contains an id, the
size of the following sub chunks and it's
type. In the wave format described on the
right, you can see that the riff header is a
list chunk. The second chunk is a data
chunk specifying the data format and the
third is a data chunk containing the actual
sample data.
In the wave format, each chunk size is
even, which means that 1 byte is padded
if length is odd. There may be additional
Figure 2.1: wave file format
sub chunks in a Wave file, and even more than one wave data chunk.
Riff chunk – header of the WAVE format:
Name
ChunkID
ChunkSize
Format
Description
Contains the letters "RIFF" in ASCII form (0x52494646 big-endian form).
This is the size of the entire file in bytes minus 8 bytes for the two fields not
included the Riff header's ChunkID and ChunkSize. Meaning 36 + SubChunk2Size.
Contains the letters "WAVE" (0x57415645 big-endian form)
Format sub-chunk - describes the sound data's format:
Name
Subchunk1ID
Subchunk1Size
AudioFormat
Description
Contains the letters "FMT" (0x666d7420 big-endian form)
16 for PCM. This is the size of the rest of the Sub chunk which follows this number
PCM = 1 (i.e. Linear quantization) Values other than 1 indicate some form of
6
NumChannels
SampleRate
ByteRate
BlockAlign
BitsPerSample
compression
Mono = 1, Stereo = 2, etc.
8000Hz, 44100Hz, etc.
Avarage byte rate (generally SampleRate  NumChannels  BitsPerSample / 8 )
NumChannels  BitsPerSample / 8
8 bits, 16 bits, etc.
Data sub-chunk - contains the size of the data and the actual sound:
Name
Subchunk2ID
Subchunk2Size
Description
Contains the letters "DATA" (0x64617461 big-endian form)
The number of bytes in the data. It's the size of the data block following this
number.
(generally it's NumSamples  NumChannels  BitsPerSample / 8 )
Data
The actual data
7
2.2 Wavpack file format
2.2.1
Description
Wavpack file consists of a series of Wavpack audio blocks. Every block contains “metadata”
- information about sound data, including sampling rate, channels, bits per sample, and
more. Metadata may also contain different coefficients using for restoring samples,
correction bitstream and actual compressed samples.
The Wavpack blocks are very easy to identify by their unique header data, which contains
among the rest information about the total size of the block and the audio format that is
stored.
These blocks are completely independent in that they can be decoded all by themselves.
They may contain up to 131072 samples, either stereo or mono, and can be lossless or
lossy.
An additional format is the correction file (.wvc) that has an identical structure to the main
file. There is a one-to-one correspondence between main file blocks that contain audio and
their correction file match. The only difference is in the headers of the blocks – the block’s
size and the CRC value.
In order to allow reduced memory requirements (mostly for hardware devices), it is
possible to decode regular Wavpack files without buffering an entire block.
2.2.2
Block header
Here is the 32-byte little-endian header description at the front of every Wavpack block:
Size
4 bytes
32 bits
16 bits
8 bits
8 bits
32 bits
Name
ckID
ckSize
version
track_no
index_no
total_samples
32 bits
32 bits
32 bits
32 bits
block_index
block_samples
flags
crc
Description
Block id ('wvpk')
Total block size (not including this field or 'wvpk')
Current valid versions are 0x402 - 0x410
Track number (not currently implemented)
Track sub index (not currently implemented)
Total samples in entire file (valid if block_index=0. A value of -1
indicates unknown length)
Index of first sample in block relative to beginning of file
Samples in this block (0 means no audio present)
Various flags for id and decoding
CRC for actual decoded data
8
2.2.3
Metadata sub-blocks
Following the 32-byte header to the end of the block are a series of “metadata” sub-blocks.
These sub-blocks contain extra information needed to decode the audio, but may also
contain user information that is not required for decoding. The final sub-block is usually the
compressed audio bitstream itself.
The format of the metadata is:
Size
8 bits
Name
id
8/24 bits
16 bits
word_size / word_size[3]
data[word_size]
Description
4 masks available: 0x1f - metadata function, 0x20 - decoder doesn’t need
to understand metadata, 0x40 - data length is less than 1, 0x80 - large
block (> 255 words)
small block: data size in words / large block: data size in words
data, padded to an even number of bytes
The more relevant metadata ids available:
ID_DUMMY - could be used to pad Wavpack blocks
ID_WV_BITSTREAM - normal compressed audio bitstream (wv file)
ID_WVC_BITSTREAM - correction file bitstream (wvc file)
ID_WVX_BITSTREAM - special extended bitstream for floating point data or long integers (> 24 bit)
ID_RIFF_HEADER - RIFF header for .wav files (before audio)
ID_RIFF_TRAILER - RIFF trailer for .wav files (after audio)
2.2.4
Metadata tags
These tags are special tags for storing user data such as artist, title, album, track, year, etc.
2.3 Algorithm and Program flow
Wavpack is an open audio compression algorithm and an open source software
implementation that supports three compression modes - lossless, lossy and a unique hybrid
compression mode.
The project’s focus is on the lossless stereo mode in which the audio samples are simply
compressed at their full resolution and no information is discarded along the way.
The basic algorithm has three main parts:
1. A joint stereo processing which removes inter-channel correlations.
2. Multipass decorrelations which removes intra-channel correlations between
neighboring audio samples.
9
3. An entropy encoder used to compress the data.
The input stream is partitioned into blocks that can be either mono or stereo and are about 0.5
seconds long. For each of these blocks, the first step is to convert the left and right channels
into difference and average (also referred to as side and mid).
The second step is prediction. This is where multiple passes are done for each block and using a
set of filters and an adaptation LMS algorithm. Wavpack allows between 2 and 16 passes
(default is five passes).
Finally, the weight is updated for the next sample based on the signs of the filter input and the
residual.
In the lossless mode, the results of the decorrelation (the residuals or weights) of all the passes
are passed to the entropy coder for exact translation. The entropy coder uses variations on the
standard algorithms Elias and Golomb to produce the compressed lossless output.
The implementation is very similar to the description of the algorithm. It goes through the
samples performing the first step in large blocks (i.e. computing side/mid & handling different
sampling sizes). It then iterates on the 0.5 second blocks which are 24,000 samples (48,000 is
the amount of samples per second in a CD quality stereo file). It performs the second step (i.e.
the multiple passes) and finally it compresses the received blocks. The weights are passed down
to the functions via the global context, which contains the bit stream as well as the additional
information for each pass.
10
Figure 2.2: a diagram describing the flow of the original implementation.
11
3. Benchmark
3.1 Testing Environment
3.1.1
Hardware
The project was developed and tested on a Core i7 2.66GHz CPU and a Quad6600 2.4GHz
both with 4GB of ram.
3.1.2
Software
Windows XP/Vista with Visual studio 2008 for development and debugging, and Intel
VTune Toolkit for performance testing, memory leaks and thread checks.
The project is compiled with Microsoft compiler.
3.2 Testing Case In Wavpack
Wavpack program supports many different decoding & encoding options. We have decided to
improve one mode – lossless stereo mode. In order to test this mode, we used a 330Mb WAV
file with two channels (stereo).
Note: Wavpack can process a 330Mb WAV file in about 30 seconds.
3.3 Running WavPack with VTune
VTune is an Intel software designed to measure CPU behavior while running a specified
process. It can integrate with a visual studio solution and display hot spots on methods in the
code, according to the CPU load.
For our purpose, we mostly used the measures: CPI, cache misses, branch misses.
3.4 Original Version Performance
As noted above, processing a 330MB file takes about 30 seconds.
In Figure 3.1 you can see the results of VTune analysis on the original program – notice that
there are three major functions which consume most of the CPU time:
Figure 3.1: VTune hotspots analysis results.
12
‘decorr_stereo_pass_id2’ – a mathematical function for processing the WAV blocks to WV
blocks.
‘send_words_lossless’ – a significant part of block processing, especially contains logic as a
preparation for file writing.
‘flush_word’ – a function for writing data (bits) to file.
Figure 3.2: VTune analysis call graph.
In addition, you can visually see the functions that consume the majority of the CPU time in
figure 3.2 (marked in red). Notice that the three functions that mentioned above are called for
each block (the function ‘flush_word’ is mostly called within the function
‘send_words_lossless’).
13
3.5 Conclusions and Objectives
At first glance, the wavpack encoding algorithm is made up of several steps that split the data
into many elements and process each of them separately, but unfortunately each "small"
element of data is dependent on the previous one via the wavpack context. This negates the
possibility of parallelizing the entire flow completely without changing the algorithm itself.
Also, there are three functions or code segments that are worth improving since they are the
most CPU time consumers in the program. Other segments in the code might be improvable,
but probably aren’t worth the effort.
With that in mind, we decided to try several optimizations, each independent from the other.
Generally, our objectives were to:
1. Parallelize the read/write/process operations and gain a few seconds (this improvement
will be close to constant as files grow larger since IO is always slower the CPU).
2. Try to parallelize a segment of the code or a flow of several functions instead of the entire
program.
3. Introduce SIMD into the code, using these instructions mainly on loops. In addition,
attempt to take several bytes of "audio" at a time from the buffer and calculate the output
at once (Note: there are some usages of these features but with older versions - several
years old).
4. Try to make some implementation improvements. We detected some functions that we
might be able to unroll (not necessarily with SIMD) and functions (math based mostly) with
potential for significant improvement (this will be explained in detail at section 7).
Of course, according to the VTune analysis, our main focus will be on the three major functions
we found.
14
4. First Optimization - Parallel IO/CPU
4.1 Description
Our first attempt was to parallelize I/O operations (read, write) and CPU operations.
Since the algorithm of the original program includes file processing by blocks, and therefore
includes a read operation from the source file for each block and write operation as well, we
concluded that we can parallel the reading/writing operations, and by that increase the
program’s performance.
Since reading/writing operations are not the bottle neck of the application, there is no point in
reading or writing with more than one thread for each operation.
With this conclusion, we chose to implement the multithreading idea with two threads (except
the main thread) - one for block processing and one for writing the processed blocks to file.
The reading part is done by the main thread.
The communication between the threads is done by using two queues (see Appendix A) of
jobs. The first queue is for blocks, waiting to be processed after reading from file was done.
The second queue is for processed blocks, waiting to be written.
For each block we read by the main thread, we read it with the same algorithm as the original,
and then we enqueue it into the first queue. The worker thread, responsible for processing
these waiting jobs, recognize that it has a waiting job, pulls it off the queue, and starts with the
processing stage. In the original program, the writing to the file is done while processing the
blocks. In order to achieve parallelization, we changed the destination of the writing action to a
temporary buffer instead of the file. In this way, we created processed blocks, which now can
be enqueued to the second queue, and wait for the writing thread to do its job. The writing
thread pulls off the current job, and starts working, meaning it writes the block into the
destination file. This thread is also responsible for displaying progress to screen, and free all
allocated memory.
This algorithm continues until all blocks have been read, processed, and written into the
destination file.
15
In figure 4.1 you can see the new program flow as a result of our improvement.
Figure 4.1: detailed program flow after parallelizing CPU & I/O operations.
With this implementation, the tasks are running simultaneously – we can read new blocks
while processing other ones. The writing tasks can also be performed while the other two tasks
are still running.
16
4.2 Performance Testing
We ran VTune analysis with this optimization only. The results were:
Figure 4.2: VTune analysis results.
The main thread (reader) is the second in the table. We can see that it takes about 3.83% of
the total processing time. The writing thread used less than 1% of CPU time.
4.3 Conclusions
From the above results, we learn that no significant improvement can be achieved here.
The reason for that is the fact that the I/O operations take considerably less time than the
blocks processing and the main thread(reader) finishes his work a lot sooner than the
processing thread.
In addition, the writing thread is almost never busy, since the time that it takes to write the
processed blocks to the file is negligible to the time that it takes to process these blocks.
Therefore, we can see that the writing thread did virtually nothing.
If we take a look at the table results, we can see that the reading took about 3.83% of the total
running time. This percentage comes to about 1 second of improvement and the total speedup
29.625
we get here is 28.859 = 1.026.
Considering the implementation time and program readability vs. performance improvement,
this optimization wasn’t worthwhile.
17
5. Second Optimization - Multi Threaded Processing
5.1 Description
After trying to parallelize the I/O operations, we wanted to find other areas to parallelize. The
next step was to parallelize the processing stage, by creating multiple threads, and letting each
one do some section of the code. As said before, the blocks are dependent and must be
processed sequentially, and therefore we can’t process two blocks in two different threads
simultaneously.
Once we concluded that we can’t parallelize the entire flow, we searched in the program for
parts that take most of the CPU time, and can be parallelized efficiently. In our search, we used
Intel® VTune™ Performance Analyzer (see section 3.2).
After resolving relevant code sections for multithreading, our method was to create a task
entry function and structure of relevant data for each thread. The structure is filled
dynamically with updated data, and holds other useful information that the thread needs for
running (See figure 5.1). Some of the fields of this structure contains data that has to be
processed.
// setup thread #1 args.
args_0.dpps = wps->decorr_passes;
args_0.terms = wps->num_terms;
args_0.sample_count = sample_count;
args_0.buffer = buffer;
args_0.flags = flags;
// setup thread #2 args.
args_1.dpps = dpps_tmp;
memcpy(args_1.dpps, wps->decorr_passes, sizeof(struct decorr_pass)*MAX_NTERM);
args_1.terms = wps->num_terms;
args_1.sample_count = sample_count;
args_1.buffer = buffer_tmp;
memcpy(args_1.buffer, buffer, INPUT_SAMPLES*sizeof(int32_t*));
args_1.flags = flags;
Figure 5.1: code snippet from the function pack_samples – creating structures for the threads
The problem with such data is that every thread is working on different part of this data, but
still has to access the same memory addresses. For example, in some section of the code we
work on samples buffer, with two threads. One thread is running on even samples, while the
other thread is running on odd samples. If those two threads were running on the same buffer
(same memory address, offset is always from the start), even though they would be running on
different data we would still get memory sharing conflicts and lower performance. In order to
improve the performance and create independent jobs for the threads, we create a copy of the
18
whole data. Now, each thread is working on its own data, and there is no data sharing conflict
between them.
Because we created copies of some parts of the data, we have to wait until all the threads
have finished their jobs, merge the results of each thread into the original buffer, and free all
allocated memory.
In order to reduce multithreading creation overhead, we used a thread pool (see Appendix B).
When we want to parallelize some sections of the code, we use the method described above,
and then submit the job to the thread pool (See figure 5.2). An available thread in the pool
“takes” the job, and starts working. In this way, there is no need to create the thread for each
job we want to run, and we use the same thread for multiple jobs.
// make them run through thread pool
// thread1 will run the even samples
submitWork(tp, run_dpp_0, &args_0, decorrLock);
// thread2 will run the odd samples
submitWork(tp, run_dpp_1, &args_1, decorrLock);
// the calling thread will wait for both
waitForHandle(decorrLock, 2);
Figure 5.2: code snippet from the function pack_samples – submitting jobs to thread pool and wait for job to be
finished.
19
In figure 5.3 you can see the new program flow after the second optimization:
Figure 5.3: detailed program flow after parallelizing specific code sections.
20
5.2 Performance Testing
We ran VTune analysis with this optimization only. The results were:
Figure 5.4: VTune analysis results.
In figure 5.4, we can see all the threads and their running analysis. The first thread here is the
main thread. The other two are threads from the “thread pool”, and they are triggered by the
main thread, to do the multithreaded jobs. Here we can see (figure 5.6, figure 5.7) that the
“pool threads” were working on the function ‘decorr_stereo_pass_id2’ (we split this function
into two separate indexed functions).
Figure 5.5: running results of the main thread (thread ID 4680).
Figure 5.6: running results of the first “pool thread” (thread ID 4276).
Figure 5.7: running results of the second “pool thread” (thread ID 4272).
21
5.3 Conclusions
In comparison to original results, we can see (figures 5.6, 5.7) that each thread runs about half
of the time of the original ‘decorr_stereo_pass_id2’ running time (with single thread). Since
the original measured time was about 9.77 seconds (33% of total time of 29.625), we can
expect for an improvement of
9.77
2
= 4.885 seconds. The new measured time is 25.375
seconds, which is 4.25 seconds less than the original time. If we consider the multithreading
overhead, than it seems the results and expectations match.
29.625
In summary, the total speedup we achieved here is 25.375 = 1.167.
This gives us about 16% improvement in the performance.
22
6. Third Optimization - Moving to SIMD
6.1 Description
One of the advantages of the Intel® Core™ i7 processor is a new set of instructions (SIMD – see
Appendix C) that operate on 128 bit of data. In order to use these instructions efficiently, we
searched in the program for mathematical code sections that have many calculations and
repetitions, such as loops.
The idea was to try to convert four operations of 32 bits to one operation of 128 bits.
Theoretically, with this method, the performance of each section can be x4 faster.
In the Wavpack program, there are some sections of code that includes mathematical
calculations inside “for” loops. These loops are repeated tens of thousands of times, therefore
this was the right place to try using SIMD instead of the original implementation.
Since the original data size in each step was 32 bits, and the 128 bits instruction operates on
four elements with size of 32 bits, we had to find a way to create four independent
calculations. To do that, we used loop unrolling method, so now in each step of the loop we
calculate four steps. Inside the loop, we had to load the relevant data to 128 bit registers, and
implement the same calculations as the original program does. After the data was processed,
we have to save the data from the registers back to the buffer. In between, we used several
functions that operates on 128 bit registers.
In the following example (figure 6.1), we used SSE2 instructions that operates on 128 bit.
__m128i sam1, sam2, tmp;
…
//this code uses SSE2 assembler instructions. in addition, we added loop unrolling
for (bptr = buffer; bptr < tmp_eptr; bptr += 16)
{
// set all initial integers for the calculations, including two loops
sam1 = _mm_set_epi32(bptr[4], bptr[2], bptr[0], dpp->samples_A[0]);
tmp = _mm_set_epi32(bptr[2], bptr[0], dpp->samples_A[0], dpp->samples_A[1]);
sam2 = _mm_set_epi32(bptr[12], bptr[10], bptr[8], bptr[6]);
sam1 = _mm_slli_epi32(sam1, 1); //multiply by 2
sam1 = _mm_sub_epi32(sam1, tmp); // sub tmp from 2*sam1
// set integers for second loop
tmp = _mm_set_epi32(bptr[10], bptr[8], bptr[6], bptr[4]);
sam2 = _mm_slli_epi32(sam2, 1); //multiply by 2
sam2 = _mm_sub_epi32(sam2, tmp); // sub tmp from 2*sam2
…
}
Figure 6.1: code snippet from the function decorr_stereo_pass_id2_0
23
Each variable contains four independent elements of 32 bits, and the calculations are done on
all of the elements simultaneously. In addition, we did an extra loop unrolling, therefore in
each iteration we process eight steps. Since this specific example is about processing even
indexes of the buffer, we increment the buffer pointer by 16.
Because of the method of loop unrolling, and the fact that we don’t know the buffer’s size, we
have to consider the part of the buffer that weren’t processed (modulo 16). For that we added
another “for” loop at the end of the described one, with the original code, that calculates the
last part of the buffer.
6.2 Performance Testing
Note: We implemented this optimization based on the multithreaded code. In order to check
this optimization independent from the others, we run the work of the two threads serially.
The analysis result:
Figure 6.2: VTune analysis with SIMD optimizations results.
We can see that the improvement is achieved in the function ‘decorr_stereo_pass_id2’. In the
original program, it took 33% of the total time, and in this run it took about 30%. The number
of retired instruction is about 33,088,500,000. The total runtime is 28.39 seconds.
6.3 Conclusions
This optimization saved us 3% of the total runtime, which concludes to 28.39 seconds, instead
of 29.625 in the original application. Also, we can see here reduction in the number of retired
instructions by 4,698,000,000 instructions. This is probably a result of using SIMD instructions,
and loop unrolling methods, and that is mostly the reason for improvement at all.
The speedup we got here is
29.625
28.39
= 1.043.
This optimization alone isn’t improving performance significantly.
24
7. Fourth Optimization - Implementation Improvements
7.1 Description
After applying the optimizations discussed above, we still thought we can obtain further
improvement in speed. We decided to go over hot spots in the code and reimplement in a
more efficient way parts of the code.
The changes we made were according to VTune, locating several places in the code where
there are heavy mathematical calculations or high branch prediction misses.
In this document, we chose to include functions that were critical in terms of improvement.
According to VTune, we were still getting hot spots in the "Flush Word" function, and
specifically in the macros it was using (Code snippet is shown in figure 7.1).
#define putbit_1(bs) { (bs)->sr |= (1 << (bs)->bc); \
if (++((bs)->bc) == sizeof (*((bs)->ptr)) * 8) { \
*((bs)->ptr) = (bs)->sr; \
(bs)->sr = (bs)->bc = 0; \
if (++((bs)->ptr) == (bs)->end) (bs)->wrap (bs); \
}}
#define putbits(value, nbits, bs) { \
(bs)->sr |= (int32_t)(value) << (bs)->bc; \
if (((bs)->bc += (nbits)) >= sizeof (*((bs)->ptr)) * 8) \
do { \
*((bs)->ptr) = (bs)->sr; \
(bs)->sr >>= sizeof (*((bs)->ptr)) * 8; \
if (((bs)->bc -= sizeof (*((bs)->ptr)) * 8) > 32 sizeof (*((bs)->ptr)) * 8) \
(bs)->sr |= ((value) >> ((nbits) - (bs)->bc)); \
if (++((bs)->ptr) == (bs)->end) (bs)->wrap (bs); \
} while ((bs)->bc >= sizeof (*((bs)->ptr)) * 8); \
}
Figure 7.1: putbit_1, putbits macros code snippets.
This function is used to put bits into the output buffer. It stores '1' and '0' into a temporary
variable of int32 and after 16 bits have been set, it sends the bits to the output buffer. The
macros it uses are putbit, putbits, putbit_0, putbit_1. Since these macros use large amount of
branches which are depend on the input file, reducing these branches will clearly give a
performance boost.
We finally implemented two changes: the first was reimplementing the macros to use no
branches and the second was to use a variable of int64 to write the bits in larger blocks.
25
#define putbit_1_opt(bs) { \
(bs)->num |= (uint64_t)1 << (bs)->bc; \
++((bs)->bc); \
}
#define putbits_opt(value, nbits, bs) { \
(bs)->num |= (uint64_t)(value) << (bs)->bc; \
(bs)->bc += nbits; \
}
Figure 7.2: code snippet – putbit_1, putbits macros after optimization.
7.2 Performance Testing
One of the main reasons for implementing this optimization is branch mispredicting. In order
to show the difference between the misprediction in the original program and the optimized
program, we’ll first run a branch misprediction analysis on the original program:
Figure 7.3: Original program - VTune analysis for branch misprediction.
In the function ‘flush_word’, we can see that there are about 78,420,338 mispredicted
branches, from a total of 1,713,906,754 branch instructions.
Now let’s run the optimized program, and check its results:
Figure 7.4: Optimized program - VTune analysis for branch misprediction.
We can see a drastic reduction in the branch misprediction in the optimized function –
‘flush_word_opt’ – about fifth of the original (15,435,072). Also, the total number of the
branch instructions is less than the number in the original program – about 400 million events
less.
Another result of this optimization is that the function ‘send_words_lossless’ has more branch
events – about 800 million more, but more than half of the mispredicted branches in the
original program are now predicted correctly.
Another optimization that improves performance is the fact that we used 64 bit data elements
instead of 32 bit.
26
All these improvements conclude to a total runtime of 27.937 seconds.
7.3 Conclusions
As a result of branch and branch misprediction reduction, and using 64 bit integer instead of 32
bit integer, we can see an improvement in performance – almost 2 seconds less from the
original program.
It seems that we could use 128 bit with SIMD instructions, but then the use of the 128 bit
registers will cause too much overhead and we won’t achieve any speedup. For that reason,
reimplementing with 64 bit integer which interprets to 64 bit register was the best choice here.
29.625
The total speedup we got here Is 27.937 = 1.06.
27
8. Optimization Summary
Here we can see all the threads that run in the total optimized application:
Figure 8.1: VTune analysis results of all optimizations together
1.
2.
3.
4.
5.
Main thread (Thread ID 4024)
“Pool thread” (Thread ID 3900) – working on the left channel, as described in section 5.
“Pool thread” (Thread ID 7124) – working on the right channel, as described in section 5.
Reading thread (Thread ID 6596) – reading the blocks, as described in section 4.
Writing thread (Thread ID 2984) – writing the blocks, as described in section 4.
Here, we can see that each thread is using different core. In this way, it is assured that the
multithreading is most efficient.
Figure 8.2: VTune analysis results of all optimizations together, threads with CPU info
28
In figure 8.3, we can see each optimization speedup, and the total speedup. It seems that the most
significant optimization was the code sections multithreading, with 16% speedup, while the most
insignificant was the multithreaded I/O, with 2.6% speedup.
Optimizations Steps over Speed Up
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Speed Up [%]
Figure 8.3: VTune analysis results of all optimizations together
The total speedup we achieved is
29.625
22.187
= 1.335, meaning the program runs faster by 33.5%.
29
9. Appendix A – Blocking Queue
The blocking queue is an ADT which blocks incoming calls to dequeue until there is an item in the
queue to return or to enqueue when there is no more room.
It's initialized with a size parameter and contains an array of items of that size, as well as some
semaphores to maintain the blockings.
When dequeue is invoked, the queue will block the calling thread until it has an item in the
queue. When the queue will have an item to return, it will return it.
When enqueue is invoked, the queue will add the item to the queue, unless the queue is full, in
which case the calling thread will be blocked until the queue has room for the incoming job.
The queue also provides an additional feature of "terminating" the queue. This will let the queue
know that there are no more incoming items. When the next dequeue comes, if the queue is
empty it will not lock the caller, but return "terminated".
Calling delete_queue will release its resources.
30
10. Appendix B – Thread Pool
The thread pool is a singleton ADT.
It's created with a constant number of threads, and consists of a blocking queue for incoming
jobs.
All the pool's threads start in a loop waiting on the queue's dequeue method for an incoming job.
To submit a job to the pool, the current thread creates a "thread job" which stores the function
pointer of what to do, the arguments and a mutex object to synchronize on.
Once the job have been submitted, the next thread waiting on the queue will be released from
the queue's lock and the job will be returned to it.
The thread will run that job, and when done, release the mutex object and free the job's
resources and return to wait on the queue for another job.
If all the threads are working, the jobs will wait in the queue.
When the application ends, it calls the pool's delete method to close the pool's threads and to
release any resources left.
31
11. Appendix C – SIMD
SIMD stands for Single Instruction Multiple Data. It means that we can do operations on N
elements with one instruction. In our project we use Intel Core i7 processor, which has support
for 128 bit registers. With these registers, we can do four operations on 32 bit elements, or two
operations on 64 bit elements, simultaneously. This can be significant while trying to achieve
performance speedup.
In our project, since most of our SIMD improvements were on mathematical calculations, we used
mostly SSE, SSE2 instructions.
The way of using those instructions is by using ‘intrinsic’ type – a wrapper for SIMD instruction
implemented in visual c++.
32
12. References
http://www.wavpack.com
http://sourceforge.net/
http://softlab.technion.ac.il/
http://msdn.microsoft.com
http://en.wikipedia.org/wiki/
http://www.google.co.il/
http://www.intel.com/
33
Download