DOC Version

advertisement
School of Computer and Information Science
CIS Research Placement Report
Multiple threads in
floating-point sort operations
Name: Quang Do
Date: 8/6/2012
Supervisor: Grant Wigley
Abstract
Despite the vast improvements in CPU design and the inclusion of multiple cores in even the
low end processors, little has been done to improve the performance of sorting algorithms in
this respect. This paper aims to justify the benefits of the use of multiple cores in these
operations. It will also show how to implement these sorting operations.
1. Introduction
Multi-threaded applications are starting to become a necessity as hardware
development shifts from speed to multiple cores. Even though they have been
around for quite a while, many applications and operating systems do not yet
efficiently use them.
The purpose of this paper is to show the possible performance increases by
multi-threading different sorting algorithms. This is achieved by showing uses of
different sorting algorithms and to display the differences between single
threaded and multi-threaded versions of the same algorithms. An overview of
the steps taken to achieve the data is given along with the resulting data.
Sorting algorithms make up the basis of many applications used in the real world.
Since their uses are so abundant, it is extremely useful to be able to speed up
their processing time. This results in more efficient applications that respond
and output data at an increased rate and throughput.
Although there are many advantages to multi-threading an algorithm, which is
the reason why so many technologies and hardware platforms are moving to
multiple core processors, there are also several disadvantages.
1.1 Floating Point Numbers
Floating-point numbers, in this paper, are 32 bit single precision numbers. All
sorting operations performed are on these 4 byte values.
1.2 Sorting Algorithms
There are several sorting algorithms that are used within this document. Most of
these are commonly used to display the differences between sorting efficiency
instead of real world uses.
Insertion sort is used to display the performance changes for an algorithm which
does not have an average processing time of 𝑛𝑙𝑜𝑔𝑛 but instead 𝑛2. Quick sort is
an algorithm which is used in Java for its sorting operations. As such it is often
used in real world applications and games. It has an average case performance of
𝑛𝑙𝑜𝑔𝑛. Shell sort is yet another sorting algorithm with a different average
3
processing time. It has an average performance case of 𝑛2 .
All of these algorithms, though, are naturally sequential and single threaded.
They rely on the data within the list to be sorted to be in one place and be
unchanged as the processing is done to sort the list.
In order to allow for comparisons of different single threaded sorting algorithms
when they are multi-threaded, the use of a parallel sorting algorithm is required.
In this paper, bucket sort is used to allow a useful comparison.
Bucket sort works by first splitting the array of objects to be sorted, in this case
float point numbers, by specified criteria. The aim is to have each of the buckets
containing a similar number of objects. For a random set of data, this may be
difficult to achieve.
For a completely random set of data, the data could be split by numbers between
certain ranges. Once the splitting of data is done, a different sorting algorithm
would be used of each of the buckets until fully sorted. Since the buckets contain
within them fully sorted data of the appropriate order, they can be simply joined
together to return the completed sorted array.
Naturally, the first division of data into buckets incurs a loss in performance, so
the algorithm does not scale the individual sorting algorithms used perfectly.
This is the downside of using a parallel sorting algorithm such as bucket sort. But
since each of the algorithms incurs the same loss in performance, the difference
is negligible.
Figure 1.1 – Bucket Sort of Percentages
Figure 1.1 shows an example of bucket sort on percentages. The data has been
arbitrarily chosen to be split equally between 0 and 100. The number of buckets
has been decided as 4. This may be because the system doing the processing is a
quad core, quad thread processor. Each of the buckets can then be sorted on each
of the threads.
The circles represent where the individual buckets’ data is sorted. This can
consist of any single or even multi-thread sorting algorithm. This area is where
each of the sorting algorithms in Figure 1.2 is used.
The placing of data into buckets in a bucket sort takes a constant O(n) time.
Sorting
Algorithm
Quick Sort
Shell Sort
Best Case
Worst Case
Average
Performance Performance Performance
𝑛𝑙𝑜𝑔𝑛
𝑛𝑙𝑜𝑔𝑛
𝑛2
3
𝑛
𝑛(𝑙𝑜𝑔𝑛2 )
𝑛2
Insertion Sort
𝑛
𝑛2
𝑛2
Figure 1.2 – Sorting Algorithms
Memory
Usage
𝑙𝑜𝑔𝑛
1
1
1.3 Insertion Sort
Insertion sort is a very simple sorting algorithm that uses comparisons to sort an
array. Its speed depends on the size of the array; larger arrays take a significantly
larger amount of time to sort. It is also efficient on arrays that are already
partially sorted and very easy to implement.
Insertion sort works by methodically going along an array, starting at position 0,
comparing the element with the rest of the elements before that position, and
putting it in the right position. For example:
[3 4 1 2]
Starting array
[3 4 1 2]
Start at first element
[3 4 1 2]
Move to next element. It is larger so no sorting is required.
[3 4 1 2]
Move to next element. Element is smaller so it is inserted at the correct position.
[1 3 4 2]
Move to next element. It is also smaller so it is inserted at the correct position.
[1 2 3 4]
Completed sorted array.
The best case for insertion sort is when the array is already sorted. It then has a
running time of O(n). The worst case for insertion sort is when it is used on array
already sorted in reverse order. This has a running time of O(n2).
1.4 Quick Sort
Quick sort is said to be the fastest sorting algorithm for sorting random sets of
data. This makes it ideal to be used for testing of floating point sorting.
Quick sort works by:
A pivot point is chosen, in the case of our algorithm, we use the middle element.
It can have any value, even if it is not within the array.
Any values in the array that are greater than the pivot value are put on the right
of the pivot point. Values that are smaller are put in the left side and values that
are equal are left in either side.
The algorithm then recursively sorts both sides of the pivot.
[3 4 1 2]
Starting array
[3 4 1 2]
Choose pivot point
[2 1 4 3]
Arrange elements
[[2 1] [4 3]]
Recursively quick sort
[[1 2] [3 4]]
[1 2 3 4]
Sorted array
The best case for quick sort is when each partition is the same size after each
step. This is still considered O( 𝑛𝑙𝑜𝑔𝑛), which is the same as its average case.
The worst case scenario is O(n2) for when the array is when the last element in
the array is selected as the pivot and the array is already sorted.
1.5 Shell Sort
Shell sort is a sorting algorithm that, like the other algorithms discussed in this
paper, uses comparison based sorting. It is an improvement upon insertion sort
that allows elements to be exchanged even if they are not adjacently located.
A gap size is chosen. For this paper, a gap size of half of the length of the array is
used. Each element at the gap position is added to a sub array and sorted. The
gap is then reduced, in our case, it is halved. This is continuously done until the
gap size is set to 1, effectively turning into an insertion sort on an almost
completely sorted array.
[3 1 6 5 2 4]
Gap size is 3.
[3 1 6 5 2 4] Sort the array of [3 5]
[3 1 6 5 2 4] Sort the array of [1 2]
[3 1 6 5 2 4] Sort the array of [6 4] = [4 6]
= [3 1 4 5 2 6]
Gap size is 2
[3 1 4 5 2 6] Sort the array of [3 4 2] = [2 3 4]
[3 1 4 5 2 6] Sort the array of [1 5 6]
= [2 1 3 5 4 6]
Gap size is 1
[2 1 3 5 4 6]
= [1 2 3 4 5 6]
Finished array
The best case for shell sort is when the elements are already sorted. This is O(n).
The worst case scenario for this version of shell sort is the same as insertion sort.
That is when the data set is sorted in the reverse order. It is also O(n2).
2. Results
To be able to get a fair comparison of sorting algorithms when multi-threaded,
comparisons need to be made to their single threaded processing times. The
sorting algorithms being tested are: Insertion Sort, Quick Sort and Shell Sort.
The system used to perform all these sorting operations is a 64-bit Windows 7
Machine with an Intel Xeon X5677 @ 3.47GHz. This CPU is a hyper-threaded
quad core, meaning it has a total of 8 threads. The machine uses a WD Black 1TB
hard drive as the primary drive and has 6 GB of DDR3 RAM.
The sorting algorithms are programmed in Java and run using the Java Virtual
Machine. This incurs only a minor performance loss in these trials. Threads are
used in Java to perform the sorting of the float arrays. Each thread is considered
a bucket in the bucket sort algorithm.
On each of these threads, one of the following sorting algorithms is performed.
The resulting time taken is then compared with a sequential single threaded
form of the algorithm. All times are first measured in nanoseconds for the
greatest accuracy and then converted to the appropriate value.
The results are organized from 8 threads to one thread in order to easily show
the performance changes. The floating point numbers are generated using a
pseudo-random number generator.
Using 8 Threads:
Sorting Algorithms (8 Threads)
No. of Objects
10000 Objects
Shell Sort
1000 Objects
Quick Sort
Insertion Sort
100 Objects
0
0.005
0.01
0.015
Time (Seconds)
Sorting Algorithms (8 Threads)
No. of Objects
10000000 Objects
Shell Sort
1000000 Objects
Quick Sort
Insertion Sort
100000 Objects
0
2
4
6
Time (Seconds)
With fewer objects, insertion sort performs the fastest with 8 threads on the
quad core machine. At 1000 objects, insertion sort already starts to fall to the
other sorting algorithms. It regains its place at 10 000 objects, remaining similar
with shell sort in speed while quick sort is significantly slower.
Shell sort begins to show its speed when objects number 10 000 and over.
Insertion sort rapidly becomes slower than the other algorithms, eventually at
10 million objects; it is unable to sort the float array under a minute. Quick sort is
the fastest algorithm at 10 million objects.
Using 4 Threads
Sorting Algorithms (4 Threads)
No. of Objects
10000 Objects
Shell Sort
1000 Objects
Quick Sort
Insertion Sort
100 Objects
0
0.002
0.004
0.006
Time (Seconds)
Sorting Algorithms (4 Threads)
No. of Objects
10000000 Objects
Shell Sort
1000000 Objects
Quick Sort
Insertion Sort
100000 Objects
0
10
20
30
Time (Seconds)
Unlike in the results above, where insertion sort is able to best some of the other
sorting algorithms with the lower number of elements; it manages to come last
in every trial. This is due to the use of 4 threads instead of 8, severely limiting its
speed as it is an O(n2) algorithm.
With a million objects, insertion sort takes 28 seconds compared to a bit over 5
seconds with 8 threads. This is a reduction in speed of about 511%. Quick sort,
once again, is the leader in speed when a large number of elements are required
to be sorted.
Using 2 Threads
Sorting Algorithms (2 Threads)
No. of Objects
10000 Objects
Shell Sort
1000 Objects
Quick Sort
Insertion Sort
100 Objects
0
0.005
0.01
0.015
Time (Seconds)
Sorting Algorithms (2 Threads)
No. of Objects
10000000 Objects
Shell Sort
1000000 Objects
Quick Sort
Insertion Sort
100000 Objects
0.00
0.50
1.00
1.50
Time (Seconds)
Using just 2 threads, insertion sort, once again, suffers even more performance
loss. It is unable to sort even a million objects in under a minute. The time it
takes insertion sort to sort 100 000 objects is comparable to that of shell sort
sorting 10 million objects, a speed difference of 100 times.
Shell sort is the fastest using 2 threads at 1000 objects and under. Quick sort
then quickly takes the lead from 10000 or more elements.
Using 1 Thread
Sorting Algorithms (1 Thread)
No. of Objects
10000 Objects
Shell Sort
1000 Objects
Quick Sort
Insertion Sort
100 Objects
0
0.01
0.02
0.03
0.04
0.05
Time (Seconds)
Sorting Algorithms (1 Thread)
No. of Objects
10000000 Objects
Shell Sort
1000000 Objects
Quick Sort
Insertion Sort
100000 Objects
0.0
1.0
2.0
3.0
4.0
5.0
Time (Seconds)
By sorting without using bucket sort, each of the algorithms performs much
faster at 1000 and less objects. At 10 000 objects, insertion sort already shows a
massive increase in processing time. At one thread, this effect is more
pronounced and noticeable at an earlier stage. The time it takes to sort 100 000
objects is twice as long as that of shell sort sorting a 10 million object array, a
performance decrease of around 200 times.
Quick sort is once again the clear victor in terms of speed. Shell sort solidly
maintains its second place position.
3. Conclusion
Based on the results (viewable in appendix A), it can be concluded that:
-
With around 1000 objects or less to be sorted, the overhead from multithreaded sorting makes it inefficient to multi-thread these sorts; the
speed of a single-threaded sorting algorithm is much faster than a multithreaded one at this stage.
o Shell sort and quick sort both have similar sorting times at this
stage, so their use would be recommended. Although insertion sort
is slower than these two, it is not significantly slower at this stage.
-
At numbers greater than 1000, multi-threading starts to have an effect.
o Insertion sort is clearly the slowest by far of these three sorting
algorithms, achieving times that are many times greater than those
of the other two.
o Even though insertion sort gains significant speed increases every
time extra threads are added, this is still not enough for it to
compete with the other more efficient sorting algorithms.
o Single threaded sorting is still mainly faster, as with each thread
increase, the processing overhead also increases.
-
Once there are 100 000 elements are more in the floating point array, the
power of multi-threading really starts to show.
o Insertion sort, from 1 thread to 2 threads shows a performance
increase of 400%. Quick sort and shell sort also show an
improvement along the lines of 20 to 50%.
o At 4 threads, insertion sort is over 1200% faster than the single
threaded variant. At 8 threads, this improvement is over 4800%.
o With 4 threads, an improvement of 188% is noticed with Quick
sort. Shell sort is sped up by 203%.
o At 8 threads, the performance of Quick sort and Shell sort is
actually decreased from 4 threads.
-
-
With 1 000 000 elements, a single threaded insertion sort is unable to sort
this in under a minute.
o Both Quick sort and Shell sort suffer from a loss of performance
when compared to their 4 thread counterparts. Quick sort suffers
most heavily, becoming even slower than its single core version.
At extremely high numbers of 10 000 000:
o Quick sort at 4 threads is the fastest overall, taking an average of a
bit over half a second to sort the array.
o The fastest form of shell sort is also with 4 threads, taking about 50%
longer than Quick sort.
It seems that 4 threads are the ideal method for multi-threading sorting
algorithms on CPUs. This gave the best performance based on the results. The
trend seems to be that more threads are better for more elements.
This means that for arrays of data over the 10 000 000 that was tested, it may be
more efficient to increase the number of threads. This also means that insertion
sort may even be a viable solution, given enough threads; although this does
mean a great deal of overhead due to the large amount of threads.
The improvement may even be more pronounced if implemented on a GPU as
they are able to have hundreds of cores.
4. Bibliography
Sorting Algorithms Compared - Cprogramming.com. 2012. Sorting Algorithms Compared Cprogramming.com. Available at:
http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html. [Accessed 21 May
2012].
What different sorting algorithms sound like - YouTube . 2012. What different sorting algorithms
sound like - YouTube . Available at:
http://www.youtube.com/watch?feature=player_embedded&v=t8g-iYGHpEA. [Accessed 1 June
2012].
Bucket and radix sorting. 2012. Bucket and radix sorting Available at:
http://htmltolatex.sourceforge.net/samples/sample4.html. [Accessed 21 May 2012].
5. Appendix A
Floating point numbers compared with Integers
The sorting of floating point numbers may at first seem like a simple task, but the
truth is that there is much more processing required to sort a set of floating point
numbers when compared to integers. Due to each number having up to 32 bits of
information, comparisons take a significant amount of time and logical
processing. This also means that the algorithm chosen to perform this task needs
to be suitable and efficient.
Results in graph form
These results are reduced to 5 decimal places to ensure clarity. The original
results were taken with an accuracy of 9 decimal places.
8 Threads
Insertion
Sort
Quick
Sort
Shell
Sort
100
Objects
0.00145
1000
Objects
0.00232
10000
Objects
0.00734
100000
Objects
0.09247
1000000
Objects
5.40445
10000000
Objects
0.00152
0.00168
0.01124
0.07493
0.16500
0.63655
0.00180
0.00162
0.00669
0.03030
0.11285
0.78513
4 Threads
Insertion
Sort
Quick
Sort
Shell
Sort
100
Objects
0.00006
1000
Objects
0.00169
10000
Objects
0.00592
100000
Objects
0.30377
1000000
Objects
28.11392
10000000
Objects
0.00004
0.00126
0.00266
0.00629
0.05577
0.57307
0.00003
0.00113
0.00361
0.00824
0.08509
0.76169
2 Threads
Insertion
Sort
Quick
Sort
Shell
Sort
100
Objects
0.00101
1000
Objects
0.00197
10000
Objects
0.01458
100000
Objects
1.13258
1000000
Objects
10000000
Objects
0.00090
0.00146
0.00242
0.00866
0.08068
0.82542
0.00084
0.00121
0.00347
0.01049
0.11192
1.23144
1 Thread
Insertion
Sort
Quick
Sort
Shell
Sort
100
Objects
0.00005
1000
Objects
0.00097
10000
Objects
0.04535
100000
Objects
4.46106
1000000
Objects
10000000
Objects
0.00003
0.00032
0.00131
0.01181
0.12513
1.42635
0.00003
0.00042
0.00176
0.01673
0.19101
2.32085
Download