School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast improvements in CPU design and the inclusion of multiple cores in even the low end processors, little has been done to improve the performance of sorting algorithms in this respect. This paper aims to justify the benefits of the use of multiple cores in these operations. It will also show how to implement these sorting operations. 1. Introduction Multi-threaded applications are starting to become a necessity as hardware development shifts from speed to multiple cores. Even though they have been around for quite a while, many applications and operating systems do not yet efficiently use them. The purpose of this paper is to show the possible performance increases by multi-threading different sorting algorithms. This is achieved by showing uses of different sorting algorithms and to display the differences between single threaded and multi-threaded versions of the same algorithms. An overview of the steps taken to achieve the data is given along with the resulting data. Sorting algorithms make up the basis of many applications used in the real world. Since their uses are so abundant, it is extremely useful to be able to speed up their processing time. This results in more efficient applications that respond and output data at an increased rate and throughput. Although there are many advantages to multi-threading an algorithm, which is the reason why so many technologies and hardware platforms are moving to multiple core processors, there are also several disadvantages. 1.1 Floating Point Numbers Floating-point numbers, in this paper, are 32 bit single precision numbers. All sorting operations performed are on these 4 byte values. 1.2 Sorting Algorithms There are several sorting algorithms that are used within this document. Most of these are commonly used to display the differences between sorting efficiency instead of real world uses. Insertion sort is used to display the performance changes for an algorithm which does not have an average processing time of 𝑛𝑙𝑜𝑔𝑛 but instead 𝑛2. Quick sort is an algorithm which is used in Java for its sorting operations. As such it is often used in real world applications and games. It has an average case performance of 𝑛𝑙𝑜𝑔𝑛. Shell sort is yet another sorting algorithm with a different average 3 processing time. It has an average performance case of 𝑛2 . All of these algorithms, though, are naturally sequential and single threaded. They rely on the data within the list to be sorted to be in one place and be unchanged as the processing is done to sort the list. In order to allow for comparisons of different single threaded sorting algorithms when they are multi-threaded, the use of a parallel sorting algorithm is required. In this paper, bucket sort is used to allow a useful comparison. Bucket sort works by first splitting the array of objects to be sorted, in this case float point numbers, by specified criteria. The aim is to have each of the buckets containing a similar number of objects. For a random set of data, this may be difficult to achieve. For a completely random set of data, the data could be split by numbers between certain ranges. Once the splitting of data is done, a different sorting algorithm would be used of each of the buckets until fully sorted. Since the buckets contain within them fully sorted data of the appropriate order, they can be simply joined together to return the completed sorted array. Naturally, the first division of data into buckets incurs a loss in performance, so the algorithm does not scale the individual sorting algorithms used perfectly. This is the downside of using a parallel sorting algorithm such as bucket sort. But since each of the algorithms incurs the same loss in performance, the difference is negligible. Figure 1.1 – Bucket Sort of Percentages Figure 1.1 shows an example of bucket sort on percentages. The data has been arbitrarily chosen to be split equally between 0 and 100. The number of buckets has been decided as 4. This may be because the system doing the processing is a quad core, quad thread processor. Each of the buckets can then be sorted on each of the threads. The circles represent where the individual buckets’ data is sorted. This can consist of any single or even multi-thread sorting algorithm. This area is where each of the sorting algorithms in Figure 1.2 is used. The placing of data into buckets in a bucket sort takes a constant O(n) time. Sorting Algorithm Quick Sort Shell Sort Best Case Worst Case Average Performance Performance Performance 𝑛𝑙𝑜𝑔𝑛 𝑛𝑙𝑜𝑔𝑛 𝑛2 3 𝑛 𝑛(𝑙𝑜𝑔𝑛2 ) 𝑛2 Insertion Sort 𝑛 𝑛2 𝑛2 Figure 1.2 – Sorting Algorithms Memory Usage 𝑙𝑜𝑔𝑛 1 1 1.3 Insertion Sort Insertion sort is a very simple sorting algorithm that uses comparisons to sort an array. Its speed depends on the size of the array; larger arrays take a significantly larger amount of time to sort. It is also efficient on arrays that are already partially sorted and very easy to implement. Insertion sort works by methodically going along an array, starting at position 0, comparing the element with the rest of the elements before that position, and putting it in the right position. For example: [3 4 1 2] Starting array [3 4 1 2] Start at first element [3 4 1 2] Move to next element. It is larger so no sorting is required. [3 4 1 2] Move to next element. Element is smaller so it is inserted at the correct position. [1 3 4 2] Move to next element. It is also smaller so it is inserted at the correct position. [1 2 3 4] Completed sorted array. The best case for insertion sort is when the array is already sorted. It then has a running time of O(n). The worst case for insertion sort is when it is used on array already sorted in reverse order. This has a running time of O(n2). 1.4 Quick Sort Quick sort is said to be the fastest sorting algorithm for sorting random sets of data. This makes it ideal to be used for testing of floating point sorting. Quick sort works by: A pivot point is chosen, in the case of our algorithm, we use the middle element. It can have any value, even if it is not within the array. Any values in the array that are greater than the pivot value are put on the right of the pivot point. Values that are smaller are put in the left side and values that are equal are left in either side. The algorithm then recursively sorts both sides of the pivot. [3 4 1 2] Starting array [3 4 1 2] Choose pivot point [2 1 4 3] Arrange elements [[2 1] [4 3]] Recursively quick sort [[1 2] [3 4]] [1 2 3 4] Sorted array The best case for quick sort is when each partition is the same size after each step. This is still considered O( 𝑛𝑙𝑜𝑔𝑛), which is the same as its average case. The worst case scenario is O(n2) for when the array is when the last element in the array is selected as the pivot and the array is already sorted. 1.5 Shell Sort Shell sort is a sorting algorithm that, like the other algorithms discussed in this paper, uses comparison based sorting. It is an improvement upon insertion sort that allows elements to be exchanged even if they are not adjacently located. A gap size is chosen. For this paper, a gap size of half of the length of the array is used. Each element at the gap position is added to a sub array and sorted. The gap is then reduced, in our case, it is halved. This is continuously done until the gap size is set to 1, effectively turning into an insertion sort on an almost completely sorted array. [3 1 6 5 2 4] Gap size is 3. [3 1 6 5 2 4] Sort the array of [3 5] [3 1 6 5 2 4] Sort the array of [1 2] [3 1 6 5 2 4] Sort the array of [6 4] = [4 6] = [3 1 4 5 2 6] Gap size is 2 [3 1 4 5 2 6] Sort the array of [3 4 2] = [2 3 4] [3 1 4 5 2 6] Sort the array of [1 5 6] = [2 1 3 5 4 6] Gap size is 1 [2 1 3 5 4 6] = [1 2 3 4 5 6] Finished array The best case for shell sort is when the elements are already sorted. This is O(n). The worst case scenario for this version of shell sort is the same as insertion sort. That is when the data set is sorted in the reverse order. It is also O(n2). 2. Results To be able to get a fair comparison of sorting algorithms when multi-threaded, comparisons need to be made to their single threaded processing times. The sorting algorithms being tested are: Insertion Sort, Quick Sort and Shell Sort. The system used to perform all these sorting operations is a 64-bit Windows 7 Machine with an Intel Xeon X5677 @ 3.47GHz. This CPU is a hyper-threaded quad core, meaning it has a total of 8 threads. The machine uses a WD Black 1TB hard drive as the primary drive and has 6 GB of DDR3 RAM. The sorting algorithms are programmed in Java and run using the Java Virtual Machine. This incurs only a minor performance loss in these trials. Threads are used in Java to perform the sorting of the float arrays. Each thread is considered a bucket in the bucket sort algorithm. On each of these threads, one of the following sorting algorithms is performed. The resulting time taken is then compared with a sequential single threaded form of the algorithm. All times are first measured in nanoseconds for the greatest accuracy and then converted to the appropriate value. The results are organized from 8 threads to one thread in order to easily show the performance changes. The floating point numbers are generated using a pseudo-random number generator. Using 8 Threads: Sorting Algorithms (8 Threads) No. of Objects 10000 Objects Shell Sort 1000 Objects Quick Sort Insertion Sort 100 Objects 0 0.005 0.01 0.015 Time (Seconds) Sorting Algorithms (8 Threads) No. of Objects 10000000 Objects Shell Sort 1000000 Objects Quick Sort Insertion Sort 100000 Objects 0 2 4 6 Time (Seconds) With fewer objects, insertion sort performs the fastest with 8 threads on the quad core machine. At 1000 objects, insertion sort already starts to fall to the other sorting algorithms. It regains its place at 10 000 objects, remaining similar with shell sort in speed while quick sort is significantly slower. Shell sort begins to show its speed when objects number 10 000 and over. Insertion sort rapidly becomes slower than the other algorithms, eventually at 10 million objects; it is unable to sort the float array under a minute. Quick sort is the fastest algorithm at 10 million objects. Using 4 Threads Sorting Algorithms (4 Threads) No. of Objects 10000 Objects Shell Sort 1000 Objects Quick Sort Insertion Sort 100 Objects 0 0.002 0.004 0.006 Time (Seconds) Sorting Algorithms (4 Threads) No. of Objects 10000000 Objects Shell Sort 1000000 Objects Quick Sort Insertion Sort 100000 Objects 0 10 20 30 Time (Seconds) Unlike in the results above, where insertion sort is able to best some of the other sorting algorithms with the lower number of elements; it manages to come last in every trial. This is due to the use of 4 threads instead of 8, severely limiting its speed as it is an O(n2) algorithm. With a million objects, insertion sort takes 28 seconds compared to a bit over 5 seconds with 8 threads. This is a reduction in speed of about 511%. Quick sort, once again, is the leader in speed when a large number of elements are required to be sorted. Using 2 Threads Sorting Algorithms (2 Threads) No. of Objects 10000 Objects Shell Sort 1000 Objects Quick Sort Insertion Sort 100 Objects 0 0.005 0.01 0.015 Time (Seconds) Sorting Algorithms (2 Threads) No. of Objects 10000000 Objects Shell Sort 1000000 Objects Quick Sort Insertion Sort 100000 Objects 0.00 0.50 1.00 1.50 Time (Seconds) Using just 2 threads, insertion sort, once again, suffers even more performance loss. It is unable to sort even a million objects in under a minute. The time it takes insertion sort to sort 100 000 objects is comparable to that of shell sort sorting 10 million objects, a speed difference of 100 times. Shell sort is the fastest using 2 threads at 1000 objects and under. Quick sort then quickly takes the lead from 10000 or more elements. Using 1 Thread Sorting Algorithms (1 Thread) No. of Objects 10000 Objects Shell Sort 1000 Objects Quick Sort Insertion Sort 100 Objects 0 0.01 0.02 0.03 0.04 0.05 Time (Seconds) Sorting Algorithms (1 Thread) No. of Objects 10000000 Objects Shell Sort 1000000 Objects Quick Sort Insertion Sort 100000 Objects 0.0 1.0 2.0 3.0 4.0 5.0 Time (Seconds) By sorting without using bucket sort, each of the algorithms performs much faster at 1000 and less objects. At 10 000 objects, insertion sort already shows a massive increase in processing time. At one thread, this effect is more pronounced and noticeable at an earlier stage. The time it takes to sort 100 000 objects is twice as long as that of shell sort sorting a 10 million object array, a performance decrease of around 200 times. Quick sort is once again the clear victor in terms of speed. Shell sort solidly maintains its second place position. 3. Conclusion Based on the results (viewable in appendix A), it can be concluded that: - With around 1000 objects or less to be sorted, the overhead from multithreaded sorting makes it inefficient to multi-thread these sorts; the speed of a single-threaded sorting algorithm is much faster than a multithreaded one at this stage. o Shell sort and quick sort both have similar sorting times at this stage, so their use would be recommended. Although insertion sort is slower than these two, it is not significantly slower at this stage. - At numbers greater than 1000, multi-threading starts to have an effect. o Insertion sort is clearly the slowest by far of these three sorting algorithms, achieving times that are many times greater than those of the other two. o Even though insertion sort gains significant speed increases every time extra threads are added, this is still not enough for it to compete with the other more efficient sorting algorithms. o Single threaded sorting is still mainly faster, as with each thread increase, the processing overhead also increases. - Once there are 100 000 elements are more in the floating point array, the power of multi-threading really starts to show. o Insertion sort, from 1 thread to 2 threads shows a performance increase of 400%. Quick sort and shell sort also show an improvement along the lines of 20 to 50%. o At 4 threads, insertion sort is over 1200% faster than the single threaded variant. At 8 threads, this improvement is over 4800%. o With 4 threads, an improvement of 188% is noticed with Quick sort. Shell sort is sped up by 203%. o At 8 threads, the performance of Quick sort and Shell sort is actually decreased from 4 threads. - - With 1 000 000 elements, a single threaded insertion sort is unable to sort this in under a minute. o Both Quick sort and Shell sort suffer from a loss of performance when compared to their 4 thread counterparts. Quick sort suffers most heavily, becoming even slower than its single core version. At extremely high numbers of 10 000 000: o Quick sort at 4 threads is the fastest overall, taking an average of a bit over half a second to sort the array. o The fastest form of shell sort is also with 4 threads, taking about 50% longer than Quick sort. It seems that 4 threads are the ideal method for multi-threading sorting algorithms on CPUs. This gave the best performance based on the results. The trend seems to be that more threads are better for more elements. This means that for arrays of data over the 10 000 000 that was tested, it may be more efficient to increase the number of threads. This also means that insertion sort may even be a viable solution, given enough threads; although this does mean a great deal of overhead due to the large amount of threads. The improvement may even be more pronounced if implemented on a GPU as they are able to have hundreds of cores. 4. Bibliography Sorting Algorithms Compared - Cprogramming.com. 2012. Sorting Algorithms Compared Cprogramming.com. Available at: http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html. [Accessed 21 May 2012]. What different sorting algorithms sound like - YouTube . 2012. What different sorting algorithms sound like - YouTube . Available at: http://www.youtube.com/watch?feature=player_embedded&v=t8g-iYGHpEA. [Accessed 1 June 2012]. Bucket and radix sorting. 2012. Bucket and radix sorting Available at: http://htmltolatex.sourceforge.net/samples/sample4.html. [Accessed 21 May 2012]. 5. Appendix A Floating point numbers compared with Integers The sorting of floating point numbers may at first seem like a simple task, but the truth is that there is much more processing required to sort a set of floating point numbers when compared to integers. Due to each number having up to 32 bits of information, comparisons take a significant amount of time and logical processing. This also means that the algorithm chosen to perform this task needs to be suitable and efficient. Results in graph form These results are reduced to 5 decimal places to ensure clarity. The original results were taken with an accuracy of 9 decimal places. 8 Threads Insertion Sort Quick Sort Shell Sort 100 Objects 0.00145 1000 Objects 0.00232 10000 Objects 0.00734 100000 Objects 0.09247 1000000 Objects 5.40445 10000000 Objects 0.00152 0.00168 0.01124 0.07493 0.16500 0.63655 0.00180 0.00162 0.00669 0.03030 0.11285 0.78513 4 Threads Insertion Sort Quick Sort Shell Sort 100 Objects 0.00006 1000 Objects 0.00169 10000 Objects 0.00592 100000 Objects 0.30377 1000000 Objects 28.11392 10000000 Objects 0.00004 0.00126 0.00266 0.00629 0.05577 0.57307 0.00003 0.00113 0.00361 0.00824 0.08509 0.76169 2 Threads Insertion Sort Quick Sort Shell Sort 100 Objects 0.00101 1000 Objects 0.00197 10000 Objects 0.01458 100000 Objects 1.13258 1000000 Objects 10000000 Objects 0.00090 0.00146 0.00242 0.00866 0.08068 0.82542 0.00084 0.00121 0.00347 0.01049 0.11192 1.23144 1 Thread Insertion Sort Quick Sort Shell Sort 100 Objects 0.00005 1000 Objects 0.00097 10000 Objects 0.04535 100000 Objects 4.46106 1000000 Objects 10000000 Objects 0.00003 0.00032 0.00131 0.01181 0.12513 1.42635 0.00003 0.00042 0.00176 0.01673 0.19101 2.32085