Searching, Sorting, and Hashing Techniques

UNIT V SEARCHING, SORTING AND HASHING TECHNIQUES 9 Searching – Linear Search – Binary Search. Sorting – Bubble sort – Selection sort – Insertion sort – Shell sort –. Merge Sort – Hashing – Hash Functions – Separate Chaining – Open Addressing – Rehashing – Extendible Hashing. SEARCHING Searching is the process of finding some particular element in the list. If the element is present in the list, then the process is called successful, and the process returns the location of that element; otherwise, the search is called unsuccessful. Based on the type of search operation, these algorithms are generally classified into two categories: 1. Sequential Search: In this, the list or array is traversed sequentially and every element is checked. For example: Linear Search. 2. Interval Search: These algorithms are specifically designed for searching in sorted data-structures. These type of searching algorithms are much more efficient than Linear Search as they repeatedly target the center of the search structure and divide the search space in half. For Example: Binary Search. LINEAR SEARCH Linear search, also called as sequential search, is a very simple method used for searching an array for a particular value. It works by comparing the value to be searched with every element of the array one by one in a sequence until a match is found. If the match is found, then the location of the item is returned; otherwise, the algorithm returns NULL. Linear search is mostly used to search an unordered list of elements (array in which data elements are not sorted). The worst-case time complexity of linear search is O(n) The steps used in the implementation of Linear Search are listed as follows o First, we have to traverse the array elements using a for loop. o In each iteration of for loop, compare the search element with the current array element, and - o o If the element matches, then return the index of the corresponding array element. o If the element does not match, then move to the next element. If there is no match or the search element is not present in the given array, return -1. ALGORITHM Linear_Search(a, n, val) // 'a' is the given array, 'n' is the size of given array, 'val' is the value t o search Step 1: set pos = -1 Step 2: set i = 1 Step 3: repeat step 4 while i <= n Step 4: if a[i] == val set pos = i print pos go to step 6 [end of if] set ii = i + 1 [end of loop] Step 5: if pos = -1 print "value is not present in the array " [end of if] Step 6: exit In Steps 1 and 2 of the algorithm, we initialize the value of POS and I. In Step 3, a while loop is executed that would be executed till I is less than N (total number of elements in the array). In Step 4, a check is made to see if a match is found between the current array element and VAL. If a match is found, then the position of the array element is printed, else the value of I is incremented to match the next element with VAL. However, if all the array elements have been compared with VAL and no match is found, then it means that VAL is not present in the array. WORKING OF LINEAR SEARCH Now, let's see the working of the linear search Algorithm. To understand the working of linear search algorithm, let's take an unsorted array. It will be easy to understand the working of linear search with an example. Let the elements of array are - Let the element to be searched is K = 41 Now, start from the first element and compare K with each element of the array. The value of K, i.e., 41, is not matched with the first element of the array. So, move to the next element. And follow the same process until the respective element is found. Now, the element to be searched is found. So algorithm will return the index of the element matched. LINEAR SEARCH COMPLEXITY 1. Time Complexity o Best Case Complexity - In Linear search, best case occurs when the element we are finding is at the first position of the array. The best-case time complexity of linear search is O(1). o o Average Case Complexity - The average case time complexity of linear search is O(n). Worst Case Complexity - In Linear search, the worst case occurs when the element we are looking is present at the end of the array. The worst-case in linear search could be when the target element is not present in the given array, and we have to traverse the entire array. The worst-case time complexity of linear search is O(n). The time complexity of linear search is O(n) because every element in the array is compared only once. 2. Space Complexity o The space complexity of linear search is O(1). PROGRAM #include<stdio.h> #include<conio.h> void main(){ int list[20],size,i,sElement; printf("Enter size of the list: "); scanf("%d",&size); printf("Enter any %d integer values: ",size); for(i = 0; i < size; i++) scanf("%d",&list[i]); printf("Enter the element to be Search: "); scanf("%d",&sElement); // Linear Search Logic for(i = 0; i < size; i++) { if(sElement == list[i]) { printf("Element is found at %d index", i); break; } } if(i == size) printf("Given element is not found in the list!!!"); getch(); } Output Enter size of the list: 3 Enter any 3 integer values: 35 98 12 Enter the element to be Search: 12 Element is found at 2 index Linear Search Applications 1. For searching operations in smaller arrays (<100 items). BINARY SEARCH Binary search algorithm finds a given element in a list of elements with O(log n) time complexity where n is total number of elements in the list. The binary search algorithm can be used with only a sorted list of elements. That means the binary search is used only with a list of elements that are already arranged in an order. The binary search can not be used for a list of elements arranged in random order. This search process starts comparing the search element with the middle element in the list. If both are matched, then the result is "element found". Otherwise, we check whether the search element is smaller or larger than the middle element in the list. If the search element is smaller, then we repeat the same process for the left sublist of the middle element. If the search element is larger, then we repeat the same process for the right sublist of the middle element. We repeat this process until we find the search element in the list or until we left with a sublist of only one element. And if that element also doesn't match with the search element, then the result is "Element not found in the list". ALGORITHM Step 1 - Read the search element from the user. Step 2 - Find the middle element in the sorted list. Step 3 - Compare the search element with the middle element in the sorted list. Step 4 - If both are matched, then display "Given element is found!!!" and terminate the function. Step 5 - If both are not matched, then check whether the search element is smaller or larger than the middle element. Step 6 - If the search element is smaller than middle element, repeat steps 2, 3, 4 and 5 for the left sublist of the middle element. Step 7 - If the search element is larger than middle element, repeat steps 2, 3, 4 and 5 for the right sublist of the middle element. Step 8 - Repeat the same process until we find the search element in the list or until sublist contains only one element. Step 9 - If that element also doesn't match with the search element, then display "Element is not found in the list!!!" and terminate the function. Working of Binary search Now, let's see the working of the Binary Search Algorithm. To understand the working of the Binary search algorithm, let's take a sorted array. It will be easy to understand the working of Binary search with an example. There are two methods to implement the binary search algorithm o Iterative method o Recursive method The recursive method of binary search follows the divide and conquer approach. Let the elements of array are - Let the element to search is, K = 56 We have to use the below formula to calculate the mid of the array mid = (beg + end)/2 So, in the given array - beg = 0 end = 8 mid = (0 + 8)/2 = 4. So, 4 is the mid of the array. Now, the element to search is found. So algorithm will return the index of the element matched. Binary Search complexity 1. Time Complexity o Best Case Complexity - In Binary search, best case occurs when the element to search is found in first comparison, i.e., when the first middle element itself is the element to be searched. The best-case time complexity of Binary search is O(1). o Average Case Complexity - The average case time complexity of Binary search is O(logn). o Worst Case Complexity - In Binary search, the worst case occurs, when we have to keep reducing the search space till it has only one element. The worst-case time complexity of Binary search is O(logn). 2. Space Complexity o The space complexity of binary search is O(1). PROGRAM #include<stdio.h> #include<conio.h> void main() { int first, last, middle, size, i, sElement, list[100]; clrscr(); printf("Enter the size of the list: "); scanf("%d",&size); printf("Enter %d integer values in Assending order\n", size); for (i = 0; i < size; i++) scanf("%d",&list[i]); printf("Enter value to be search: "); scanf("%d", &sElement); first = 0; last = size - 1; middle = (first+last)/2; while (first <= last) { if (list[middle] < sElement) first = middle + 1; else if (list[middle] == sElement) { printf("Element found at index %d.\n",middle); break; } else last = middle - 1; middle = (first + last)/2; } if (first > last) printf("Element Not found in the list."); getch(); } OUTPUT Enter the size of the list: 5 Enter 5 integer values in Ascending order 13579 Enter value to be search: 3 Element found at index 1. EXAMPLE Binary Search Applications In libraries of Java, .Net, C++ STL While debugging, the binary search is used to pinpoint the place where the error happens. SORTING Sorting Algorithms are methods of reorganizing a large number of items into some specific order such as highest to lowest, or vice-versa, or even in some alphabetical order. These algorithms take an input list, processes it (i.e, performs some operations on it) and produce the sorted list. EXAMPLES OF SORTING IN REAL-LIFE SCENARIOS: • Telephone Directory • Dictionary TYPES: • stable sorting. • NOT stable sorting. STABLE SORTING: If a sorting algorithm, after sorting the contents, does not change the sequence of similar content in which they appear, it is called stable sorting. NOT STABLE SORTING: If a sorting algorithm, after sorting the contents, changes the sequence of similar content in which they appear, it is called unstable sorting. EXAMPLE FOR STABLE & UNSTABLE: TYPES OF SORTING: i. Internal sorting ii. External sorting Internal Sorting: If all the data that is to be sorted can be adjusted at a time in the main memory, the internal sorting method is being performed. External Sorting: When the data that is to be sorted cannot be accommodated in the memory at the same time and some has to be kept in auxiliary memory such as hard disk, floppy disk, magnetic tapes etc, then external sorting methods are performed. Complexity of Sorting Algorithms The complexity of sorting algorithm calculates the running time of a function in which 'n' number of items are to be sorted. The most noteworthy of these considerations are: • The length of time spent by the programmer in programming a specific sorting program • Amount of machine time necessary for running the program • The amount of memory necessary for running the program • The Efficiency of Sorting Techniques The Efficiency of Sorting Techniques ● To get the amount of time required to sort an array of 'n' elements by a particular method, the normal approach is to analyze the method to find the number of comparisons (or exchanges) required by it. ● Most of the sorting techniques are data sensitive, and so the metrics for them depends on the order in which they appear in an input array. ● Various sorting techniques are analysed, that are Best case Worst case Average case TYPES OF SORTING ALGORITHM 1. 2. 3. 4. 5. 6. 7. 8. Quick Sort Bubble Sort Merge Sort Insertion Sort Selection Sort Heap Sort Radix Sort Bucket Sort Time Complexities of Sorting Algorithms: Algorithm Best Average Worst Algorithm Best Average Worst Quick Sort Ω(n log(n)) Θ(n log(n)) O(n^2) Bubble Sort Ω(n) Θ(n^2) O(n^2) Merge Sort Ω(n log(n)) Θ(n log(n)) O(n log(n)) Insertion Sort Ω(n) Θ(n^2) O(n^2) Selection Sort Ω(n^2) Θ(n^2) O(n^2) Heap Sort Ω(n log(n)) Θ(n log(n)) O(n log(n)) Radix Sort Ω(nk) Θ(nk) O(nk) Bucket Sort Ω(n+k) Θ(n+k) O(n^2) BUBBLE SORT Bubble sort is a simple sorting algorithm. This sorting algorithm is comparison-based algorithm in which each pair of adjacent elements is compared and the elements are swapped if they are not in order. This algorithm is not suitable for large data sets as its average and worst case complexity are of Ο(n2) where n is the number of items. PROGRAM #include <stdio.h> void bubble_sort(long [], long); int main() { long array[100], n, c, d, swap; printf("Enter Elements\n"); scanf("%ld", &n); printf("Enter %ld integers\n", n); for (c = 0; c < n; c++) scanf("%ld", &array[c]); bubble_sort(array, n); printf("Sorted list in ascending order:\n"); for ( c = 0 ; c < n ; c++ ) printf("%ld\n", array[c]); return 0; } void bubble_sort(long list[], long n) { long c, d, t; for (c = 0 ; c < ( n - 1 ); c++) { for (d = 0 ; d < n - c - 1; d++) { if (list[d] > list[d+1]) { /* Swapping */ t = list[d]; list[d] = list[d+1]; list[d+1] = t; } } } } OUTPUT Enter Elements 6 Enter 6 integers 15 95 45 65 72 25 Sorted list in ascending order: 15 25 45 65 72 95 SELECTION SORT Selection sort is a simple sorting algorithm which finds the smallest element in the array and exchanges it with the element in the first position. Then finds the second smallest element and exchanges it with the element in the second position and continues until the entire array is sorted. Following is a pictorial depiction of the entire sorting process – Program for Selection Sort #include <stdio.h> int main() { int array[100], n, c, d, position, swap; printf("Enter number of elements\n"); scanf("%d", &n); printf("Enter %d integers\n", n); for ( c = 0 ; c < n ; c++ ) scanf("%d", &array[c]); for ( c = 0 ; c < ( n - 1 ) ; c++ ) { position = c; for ( d = c + 1 ; d < n ; d++ ) { if ( array[position] > array[d] ) position = d; } if ( position != c ) { swap = array[c]; array[c] = array[position]; array[position] = swap; } } printf("Sorted list in ascending order:\n"); for ( c = 0 ; c < n ; c++ ) printf("%d\n", array[c]); return 0; } OUTPUT Enter number of elements 5 Enter 5 integers 25 62 8 52 11 Sorted list in ascending order: 8 11 25 52 62 INSERTION SORT What is Insertion Sort Algorithm? Insertion sort is slightly different from the other sorting algorithms. It is based on the idea that each element in the array is consumed in each iteration to find its right position in the sorted array such that the entire array is sorted at the end of all the iterations. In other words,it compares the current element with the elements on the left-hand side (sorted array).If the current element is greater than all the elements on its left hand side, then it leaves the element in its place and moves on to the next element. Else it finds its correct position and moves it to that position by shifting all the elements, which are larger than the current element, in the sorted array to one position ahead.  The above diagram represents how insertion sort works. Insertion sort works like the way we sort playing cards in our hands. It always starts with the second element as key. The key is compared with the elements ahead of it and is put it in the right place.  In the above figure, 40 has nothing before it. Element 10 is compared to 40 and is inserted before 40. Element 9 is smaller than 40 and 10, so it is inserted before 10 and this operation continues until the array is sorted in ascending order. Program #include <stdio.h> int main() { int n, array[1000], c, d, t; printf("Enter number of elements\n"); scanf("%d", &n); printf("Enter %d integers\n", n); for (c = 0; c < n; c++) { scanf("%d", &array[c]); } for (c = 1 ; c <= n - 1; c++) { d = c; while ( d > 0 && array[d] < array[d-1]) {0 t = array[d]; array[d] = array[d-1]; array[d-1] = t; d--; } } printf("Sorted list in ascending order:\n"); for (c = 0; c <= n - 1; c++) { printf("%d\n", array[c]); } return 0; } Example 2 SHELL SORT What is Shell Sort Algorithm? Shellsort, also known as Shell sort or Shell’s method, is an in-place comparison sort. It can either be seen as a generalization of sorting by exchange (bubble sort) or sorting by insertion (insertion sort). Worst case time complexity is O(n2) and best case complexity is O(nlog(n)). Shell sort algorithm is very similar to that of the Insertion sort algorithm. In case of Insertion sort, we move elements one position ahead to insert an element at its correct position. Whereas here, Shell sort starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared. Starting with far apart elements, it can move some out-of-place elements into the position faster than a simple nearest-neighbor exchange. Shelling out the Shell Sort Algorithm with Examples Example 1 Here is an example to help you understand the working of Shell sort on array of elements name A = {17, 3, 9, 1, 8} Example 2 Program #include<stdio.h> int main() { int n, i, j, temp, gap; scanf("%d", &n); int arr[n]; for(i = 0; i < n; i++) { scanf("%d", &arr[i]); } for (gap = n/2; gap > 0; gap = gap / 2) { // Do a gapped insertion sort // The first gap elements arr[0..gap-1] are already in gapped order // keep adding one more element until the entire array is gap sorted for (i = gap; i < n; i = i + 1) { // add arr[i] to the elements that have been gap sorted // save arr[i] in temp and make a empty space at index i int temp = arr[i]; // shift earlier gap-sorted elements up until the correct location for arr[i] is found for (j = i; j >= gap && arr[j - gap] > temp; j = j - gap) arr[j] = arr[j - gap]; // put temp (the original arr[i]) in its correct position arr[j] = temp; } } for(i = 0; i < n; i++) { printf("%d ", arr[i]); } } OUTPUT N=5 1 6 45 12 20 1 6 12 20 45 MERGE SORT Merge sort is a divide-and-conquer algorithm based on the idea of breaking down a list into several sub-lists until each sublist consists of a single element and merging those sublists in a manner that results into a sorted list. Idea:    Divide the unsorted list into N sublists, each containing 1 element. Take adjacent pairs of two singleton lists and merge them to form a list of 2 elements. N will now convert into N/2 lists of size 2. Repeat the process till a single sorted list of obtained. While comparing two sublists for merging, the first element of both lists is taken into consideration. While sorting in ascending order, the element that is of a lesser value becomes a new element of the sorted list. This procedure is repeated until both the smaller sublists are empty and the new combined sublist comprises all the elements of both the sublists. DIVIDE AND CONQUER STRATEGY Using the Divide and Conquer technique, we divide a problem into subproblems. When the solution to each subproblem is ready, we 'combine' the results from the subproblems to solve the main problem. Suppose we had to sort an array A. A subproblem would be to sort a sub-section of this array starting at index p and ending at index r, denoted as A[p..r]. Divide If q is the half-way point between p and r, then we can split the subarray A[p..r] into two arrays A[p..q] and A[q+1, r]. Conquer In the conquer step, we try to sort both the subarrays A[p..q] and A[q+1, r]. If we haven't yet reached the base case, we again divide both these subarrays and try to sort them. Combine When the conquer step reaches the base step and we get two sorted subarrays A[p..q] and A[q+1, r] for array A[p..r], we combine the results by creating a sorted array A[p..r] from two sorted subarrays A[p..q] and A[q+1, r]. The merge Step of Merge Sort Every recursive algorithm is dependent on a base case and the ability to combine the results from base cases. Merge sort is no different. The most important part of the merge sort algorithm is, you guessed it, merge step. The merge step is the solution to the simple problem of merging two sorted lists(arrays) to build one large sorted list(array). The algorithm maintains three pointers, one for each of the two arrays and one for maintaining the current index of the final sorted array. Step 1: Create duplicate copies of sub-arrays to be sorted // Create L ← A[p..q] and M ← A[q+1..r] int n1 = q - p + 1 = 3 - 0 + 1 = 4; int n2 = r - q = 5 - 3 = 2; int L[4], M[2]; for (int i = 0; i < 4; i++) L[i] = arr[p + i]; // L[0,1,2,3] = A[0,1,2,3] = [1,5,10,12] for (int j = 0; j < 2; j++) M[j] = arr[q + 1 + j]; // M[0,1] = A[4,5] = [6,9] Step 2: Maintain current index of sub-arrays and main array int i = j = k = i, j, k; 0; 0; p; Step 3: Until we reach the end of either L or M, pick larger among elements L and M and place them in the correct position at A[p..r] while (i < n1 && j < n2) { if (L[i] <= M[j]) { arr[k] = L[i]; i++; } else { arr[k] = M[j]; j++; } k++; } Comparing individual elements of sorted subarrays until we reach end of one. Step 4: When we run out of elements in either L or M, pick up the remaining elements and put in A[p..r] // We exited the earlier loop because j < n2 doesn't hold while (i < n1) { arr[k] = L[i]; i++; k++; } Copy the remaining elements from the first array to main subarray // We exited the earlier loop because i < n1 doesn't hold while (j < n2) { arr[k] = M[j]; j++; k++; } } This step would have been needed if the size of M was greater than L. At the end of the merge function, the subarray A[p..r] is sorted. PROGRAM: // Maintain current index of sub-arrays and main array int i, j, k; i = 0; j = 0; k = p; // Until we reach either end of either L or M, pick larger among // elements L and M and place them in the correct position at A[p..r] while (i < n1 && j < n2) { if (L[i] <= M[j]) { arr[k] = L[i]; i++; } else { arr[k] = M[j]; j++; } k++; } // When we // pick up while (i < arr[k] i++; k++; } run out of elements in either L or M, the remaining elements and put in A[p..r] n1) { = L[i]; while (j < n2) { arr[k] = M[j]; j++; k++; } } Merge Sort Complexity Time Complexity Best -O(n*log n) Worst -O(n*log n) Average -O(n*log n) Space Complexity -O(n) Stability -Yes Merge Sort Applications  Inversion count problem  External sorting  E-commerce applications From the image above, at each step a list of size M is being divided into 2 sublists of size M/2, until no further division can be done. To understand better, consider a smaller array A containing the elements (9,7,8). At the first step this list of size 3 is divided into 2 sublists the first consisting of elements (9,7) and the second one being (8). Now, the first list consisting of elements (9,7) is further divided into 2 sublists consisting of elements (9) and (7) respectively. As no further breakdown of this list can be done, as each sublist consists of a maximum of 1 element, we now start to merge these lists. The 2 sub-lists formed in the last step are then merged together in sorted order using the procedure mentioned above leading to a new list (7,9). Backtracking further, we then need to merge the list consisting of element (8) too with this list, leading to the new sorted list (7,8,9). HASHING Hashing is a technique that is used to store, retrieve and find data in the data structure called Hash Table. It is used to overcome the drawback of Linear Search (Comparison) & Binary Search (Sorted order list). It involves two important concepts Hash Table  Hash Function Hash table  A hash table is a data structure that is used to store and retrieve data (keys) veryquickly.  It is an array of some fixed size, containing the keys. Hash table run from 0 to Tablesize – 1.  Each key is mapped into some number in the range 0 to Tablesize – 1. This mapping is called Hash function.  Insertion of the data in the hash table is based on the key value obtained from the hash function.  Using same hash key value, the data can be retrieved from the hash table by fewor more Hash key comparison.  The load factor of a hash table is calculated using the formula: (Number of data elements in the hash table) / (Size of the hash table) Factors affecting Hash Table Design Hash function Table size. Collision handling scheme 0 1 2 3 . Simple Hash table with table size = 10 . 8 9 Hash function:  It is a function, which distributes the keys evenly among the cells in the HashTable.  Using the same hash function we can retrieve data from the hash table.Hash function is used to implement hash table.  The integer value returned by the hash function is called hash key.  If the input keys are integer, the commonly used hash function is H ( key ) = key % Tablesize A simple hash function typedef unsigned int index; index Hash ( const char *key , int Tablesize ) { unsigned int Hashval = 0 ; while ( * key ! = „ \0 „ ) Hashval + = * key ++ ; return ( Hashval % Tablesize ) ; } Types of Hash Functions 1. Division Method 2. Mid Square Method 3. Multiplicative Hash Function 4. Digit Folding 1. Division Method: It depends on the remainder of division. Divisor is Table Size. Formula is ( H ( key ) = key % table size ) E.g. consider the following data or record or key (36, 18, 72, 43, 6) table size = 8 2. Mid Square Method: We first square the item, and then extract some portion of the resulting digits. For example, if the item were 44, we would first compute 442=1,936. Extract the middle two digit 93 from the answer. Store the key 44 in the index 93. 3. Multiplicative Hash Function: Key is multiplied by some constant value. Hash function is given by, H(key)=Floor (P * ( key * A )) P = Integer constant [e.g. P=50] A = Constant real number [A=0.61803398987],suggested by Donald Knuth to use this constant E.g. Key 107 H(107)=Floor(50*(107*0.61803398987)) =Floor(3306.481845) H(107)=3306 4.Digit Folding Method: The folding method for constructing hash functions begins by dividing the item into equalsize pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash key value. For example, if our item was the phone number 436-555- 4601, we would take the digits and divide them into groups of 2 (43, 65, 55, 46, 01). After the addition, 43+65+55+46+01, we get 210. If we assume our hash table has 11 slots, then we need to perform the extra step of dividing by 11 and keeping the remainder. In this case 210 % 11 is 1, so the phone number 436-555-4601 hashes to slot 1. Collision If two more keys hashes to the same index, the corresponding records cannot be stored in the same location. This condition is known as collision. Characteristics of Good Hashing Function: ● It should be Simple to compute. ● Number of Collision should be less while placing record in Hash Table. ● Hash function with no collision è Perfect hash function. ● Hash Function should produce keys which are distributed uniformly in hash table. The hash function should depend upon every bit of the key. Thus the hash function that simply extracts the portion of a key is not suitable. Collision Resolution Strategies / Techniques (CRT): Obviously, two records cannot be stored in the same location. Therefore, a method used to solve the problem of collision, also called collision resolution technique, is applied. The two most popular methods of resolving collisions are: ● Separate chaining (Open Hashing) ● Open addressing. (Closed Hashing) 1.LinearProbing 2.Quadratic 3.ProbingDouble hashing SEPARATE CHAINING In chaining, each location in a hash table stores a pointer to a linked list that contains all the key values that were hashed to that location. That is, location l in the hash table points to the head of the linked list of all the key values that hashed to l. However, if no key value hashes to l, then location l in the hash table contains NULL. Figure , shows how the key values are mapped to a location in the hash table and stored in a linked list that corresponds to that location. Searching for a value in a chained hash table is as simple as scanning a linked list for an entry with the given key. Insertion operation appends the key to the end of the linked list pointed by the hashed location. Deleting a key requires searching the list and removing the element. Chained hash tables with linked lists are widely used due to the simplicity of the algorithms to insert, delete, and search a key. The code for these algorithms is exactly the same as that for inserting, deleting, and searching a value in a single linked list. While the cost of inserting a key in a chained hash table is O(1), the cost of deleting and searching a value is given as O(m) where m is the number of elements in the list of that location. Searching and deleting takes more time because these operations scan the entries of the selected location for the desired key. In the worst case, searching a value may take a running time of O(n), where n is the number of key values stored in the chained hash table. This case arises when all the key values are inserted into the linked list of the same location (of the hash table). In this case, the hash table is ineffective. Codes to initialize, insert, delete, and search a value in a chained hash table Struture of the node typedef struct node_HT { int value; struct node *next; }node; Code to initialize a chained hash table /* Initializes m location in the chainedhash table. The operation takes a running time of O(m) */ void initializeHashTable (node *hash_ta-ble[], int m) { int i; for(i=0i<=m;i++) hash_table[i]=N ULL; Code to search a value /* The element is searched in the linked list whose pointer to its head is stored in the location given by h(k). If search is successful, the function returns a pointer to the node in the linked list; otherwiseit returns NULL. The worst case running time of the search operation is given as order of size of the linked list. */ node *search_value(node *hash_table[],int val) { node *ptr; ptr = hash_table[h(x)]; while ( (ptr!=NULL) && (ptr –> value != val)) ptr = ptr –> next;if (ptr–>value == val) Code to insert a value /* The element is inserted at the beginning of the linked list whose pointer to its head is stored in the location given by h(k). The running time of the insert operation is O(1), as the new key value is always added as the firstelement of the list irrespective of the size of the linked list as well as that of the chainedhash table. */ node *insert_value( node *hash_table[], int val) { node *new_node; new_node = (node *)malloc(sizeof(node)); new_node value = val; new_node next = hash_table[h(x)]; hash_table[h(x)] = new_node; } Code to delete a value /* To delete a node from the linked list whose head is stored at the location given by h(k) in the hash table, we need to know the address of the node’s predecessor. We do this using a pointer save. The running time complexity of the delete operation is same as that of the search operation because we need to search the predecessor of the node so that the node can be removed without affecting other nodes in the list. */ void delete_value (node *hash_table[], int val) { node *save, *ptr; save = NULL; ptr = hash_table[h(x)]; while ((ptr != NULL) && (ptr value != val)) { save = ptr; ptr = ptr next; } if (ptr != NULL) return ptr; { else save next = ptr next; free (ptr); return NULL; } else printf("\n VALUE NOT FOUND"); } } Example Insert the following four keys 22 84 35 62 into hash table of size 10 using separate chaining. The hash function is H(key) = key % 10 1. H(22) = 22 % 10 =2 2. 84 % 10 = 4 3.H(35)=35%10=5 4. H(62)=62%10=2 Pros and Cons: -The main advantage of using a chained hash table is that it remains effective even when the number of key values to be stored is much higher than the number of locations in the hash table. -However, with the increase in the number of keys to be stored, the performance of a chained hash table does degrade gradually (linearly). For example, a chained hash table with 1000 memory locations and 10,000 stored keys will give 5 to 10 times less performance as compared to a chained hash table with 10,000 locations. But a chained hash table is still 1000 times faster than a simple hash table. -The other advantage of using chaining for collision resolution is that its performance, unlike quadratic probing, does not degrade when the table is more than half full. This technique is absolutely free from clustering problems and thus provides an efficient mechanism to handle collisions. -However, chained hash tables inherit the disadvantages of linked lists. First, to store a key value, the space overhead of the next pointer in each entry can be significant. Second, traversing a linked list has poor cache performance, making the processor cache ineffective. Advantages 1. More number of elements can be inserted using array of Link List Disadvantages 1. It requires more pointers, which occupies more memory space. 2. Search takes time. Since it takes time to evaluate Hash Function and also to traverse the List OPEN ADDRESSING Closed Hashing Collision resolution technique Uses Hi(X)=(Hash(X)+F(i))mod Tablesize When collision occurs, alternative cells are tried until empty cells are found. Types:• Linear Probing • Quadratic Probing • Double Hashing Hash function  H(key) = key % table size. Insert Operation  To insert a key; Use the hash function to identify the list to which the element should be inserted.  Then traverse the list to check whether the element is already present.  If exists, increment the count.  Else the new element is placed at the front of the list. LINEAR PROBING Easiest method to handle collision. Apply the hash function H (key) = key % table size Hi(X)=(Hash(X)+F(i))mod Tablesize,where F(i)=i. How to Probing: first probe – given a key k, hash to H(key) second probe – if H(key)+f(1) is occupied, try H(key)+f(2) And so forth. Probing Properties: We force f(0)=0 The ith probe is to (H (key) +f (i)) %table size. If i reach size-1, the probe has failed. Depending on f (i), the probe may fail sooner. Long sequences of probe are costly. Probe Sequence is: H (key) % table size H (key)+1 % Table size H (Key)+2 % Table size 1. H(Key)=Key mod Tablesize This is the common formula that you should apply for any hashingIf collocation occurs use Formula 2 2. H(Key)=(H(key)+i) Tablesize Where i=1, 2, 3, …… etc Consider a hash table of size 10. Using linear probing, insert the keys 72, 27, 36, 24, 63, 81, 92, and 101 into the table. Let h¢(k) = k mod m, m = 10 Initially, the hash table can be given as: 0 –1 Step 1 1 –1 2 –1 3 –1 4 –1 5 –1 6 –1 7 –1 8 –1 9 –1 Key = 72 h(72, 0) = (72 mod 10 + 0) mod 10 = (2) mod 10 =2 Since T[2] is vacant, insert key 72 at this location. 0 –1 Step 2 1 –1 2 3 –1 72 4 –1 5 –1 6 –1 7 –1 8 –1 9 –1 Key = 27 h(27, 0) = (27 mod 10 + 0) mod 10 = (7) mod 10 =7 Since T[7] is vacant, insert key 27 at this location. 0 –1 Step 3 1 –1 Key = 36 2 72 3 –1 4 –1 5 –1 6 –1 7 27 8 –1 9 –1 h(36, 0) = (36 mod 10 + 0) mod 10 = (6) mod 10 =6 Since T[6] is vacant, insert key 36 at this location. 0 1 –1 –1 Step 4 2 3 –1 72 4 –1 5 –1 6 36 7 8 –1 27 9 –1 Key = 24 h(24, 0) = (24 mod 10 + 0) mod 10 = (4) mod 10 =4 Since T[4] is vacant, insert key 24 at this location. 0 –1 1 –1 Step 5 2 3 –1 72 4 5 –1 24 6 36 7 8 –1 27 9 –1 Key = 63 h(63, 0) = (63 mod 10 + 0) mod 10 = (3) mod 10 =3 Since T[3] is vacant, insert key 63 at this location. 0 –1 1 –1 Step 6 2 3 63 72 4 5 –1 24 6 36 7 8 –1 27 9 –1 Key = 81 h(81, 0) = (81 mod 10 + 0) mod 10 = (1) mod 10 =1 Since T[1] is vacant, insert key 81 at this location. 0 0 1 81 2 72 3 63 4 24 5 –1 6 36 7 27 8 –1 9 –1 Step 7 Key = 92 h(92, 0) = (92 mod 10 + 0) mod 10 = (2) mod 10 =2 Now T[2] is occupied, so we cannot store the key 92 in T[2]. Therefore, try again for the next location. Thus probe, i = 1, this time. Key = 92 h(92, 1) = (92 mod 10 + 1) mod 10 = (2 + 1) mod 10 =3 Now T[3] is occupied, so we cannot store the key 92 in T[3]. Therefore, try again for the next location. Thus probe, i = 2, this time. Key = 92 h(92, 2) = (92 mod 10 + 2) mod 10 = (2 + 2) mod 10 =4 Now T[4] is occupied, so we cannot store the key 92 in T[4]. Therefore, try again for the next location. Thus probe, i = 3, this time. Key = 92 h(92, 3) = (92 mod 10 + 3) mod 10 = (2 + 3) mod 10 =5 Since T[5] is vacant, insert key 92 at this location. 0 –1 Step 8 1 81 2 72 3 63 Key = 101 h(101, 0) = (101 mod 10 + 0) mod 10 = (1) mod 10 =1 4 24 5 92 6 36 7 27 8 –1 9 –1 Now T[1] is occupied, so we cannot store the key 101 in T[1]. Therefore, try again for the next location. Thus probe, i = 1, this time. Key = 101 h(101, 1) = (101 mod 10 + 1) mod 10 = (1 + 1) mod 10 =2 T[2] is also occupied, so we cannot store the key in this location. The procedure will be repeated until the hash function generates the address of location 8 which is vacant and can be used to store the value in it. Pros and Cons Linear probing finds an empty location by doing a linear search in the array beginning from position h(k). Although the algorithm provides good memory caching through good locality of reference, the drawback of this algorithm is that it results in clustering, and thus there is a higher risk of more collisions where one collision has already taken place. The performance of linear probing is sensitive to the distribution of input values. As the hash table fills, clusters of consecutive cells are formed and the time required for a search increases with the size of the cluster. In addition to this, when a new value has to be inserted into the table at a position which is already occupied, that value is inserted at the end of the cluster, which again increases the length of the cluster. Generally, an insertion is made between two clusters that are separated by one vacant location. But with linear probing, there are more chances that subsequent insertions will also end up in one of the clusters, thereby potentially increasing the cluster length by an amount much greater than one. More the number of collisions, higher the probes that are required to find a free location and lesser is the performance. This phenomenon is called primary clustering. To avoid primary clustering, other techniques such as quadratic probingand double hashing are used QUADRATIC PROBING To resolve the primary clustering problem, quadratic probing can be used. With quadratic probing, rather than always moving one spot, move i2 spots from the point of collision, wherei is the number of attempts to resolve the collision. Another collision resolution method which distributes items more evenly.  From the original index H, if the slot is filled, try cells H+12, H+22, H+32,.., H + i2 withwrap-around.  Hi(X)=(Hash(X)+F(i))mod Tablesize,F(i)=i2  Hi(X)=(Hash(X)+ i2)mod Tablesize Example: Consider a hash table of size 10. Using quadratic probing, insert the keys 72, 27, 36, 24, 63, 81, and 101 into the table. Take c = 1 and c = 3. Solution Let h¢(k) = k mod m, m = 10 Initially, the hash table can be given as: 0 –1 We have, 1 –1 2 –1 h(k, i) = [h¢(k) + c i + c i2] mod m 1 2 Step 1 Key = 72 h(72, 0) = [72 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [72 mod 10] mod 10 = 2 mod 10 3 –1 4 –1 5 –1 6 –1 7 –1 8 –1 9 –1 =2 Since T[2] is vacant, insert the key 72 in T[2]. The hash table now becomes: 0 –1 Step 2 1 –1 2 3 –1 72 4 –1 5 –1 6 –1 7 –1 8 –1 9 –1 Key = 27 h(27, 0) = [27 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [27 mod 10] mod 10 = 7 mod 10 =7 Since T[7] is vacant, insert the key 27 in T[7]. The hash table now becomes: 0 –1 Step 3 1 –1 2 3 –1 72 4 –1 5 –1 6 –1 7 8 –1 27 9 –1 Key = 36 h(36, 0) = [36 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [36 mod 10] mod 10 = 6 mod 10 =6 Since T[6] is vacant, insert the key 36 in T[6]. The hash table now becomes: 01 –1 Step 4 –1 2 3 –1 72 4 –1 5 –1 6 36 7 8 –1 27 9 –1 Key = 24 h(24, 0) = [24 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [24 mod 10] mod 10 = 4 mod 10 =4 Since T[4] is vacant, insert the key 24 in T[4]. The hash table now becomes: 0 1 2 3 4 5 6 7 89 –1 Step 5 –1 –1 72 –1 24 36 –1 27 –1 Key = 63 h(63, 0) = [63 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [63 mod 10] mod 10 = 3 mod 10 =3 Since T[3] is vacant, insert the key 63 in T[3]. The hash table now becomes: 0 1 –1 Step 6 –1 2 3 63 72 4 5 –1 24 6 36 7 8 –1 27 9 –1 Key = 81 h(81,0) = [81 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [81 mod 10] mod 10 = 81 mod 10 =1 Since T[1] is vacant, insert the key 81 in T[1]. The hash table now becomes: 01 –1 Step 7 81 2 72 3 63 4 24 5 –1 6 36 7 27 8 –1 9 –1 Key = 101 h(101,0) = [101 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [101 mod 10 + 0] mod 10 = 1 mod 10 =1 Since T[1] is already occupied, the key 101 cannot be stored in T[1]. Therefore, try again for next location. Thus probe, i = 1, this time. Key = 101 h(101,0) = [101 mod 10 + 1 ¥ 1 + 3 ¥ 1] mod 10 = [101 mod 10 + 1 + 3] mod 10 = [101 mod 10 + 4] mod 10 = [1 + 4] mod 10 = 5 mod 10 =5 Since T[5] is vacant, insert the key 101 in T[5]. The hash table now becomes: 01 –1 81 2 72 3 63 4 24 5 101 6 36 7 27 8 –1 9 –1 Pros and Cons Quadratic probing resolves the primary clustering problem that exists in the linear probing technique. Quadratic probing provides good memory caching because it preserves some locality of reference. But linear probing does this task better and gives a better cache performance. One of the major drawbacks of quadratic probing is that a sequence of successive probes may only explore a fraction of the table, and this fraction may be quite small. If this happens, then we will not be able to find an empty location in the table despite the fact that the table is by no means full. In Example 15.6 try to insert the key 92 and you will encounter this problem. Although quadratic probing is free from primary clustering, it is still liable to what is known as secondary clustering. It means that if there is a collision between two keys, then the same probe sequence will be followed for both. With quadratic probing, the probability for multiple collisions increases as the table becomes full. This situation is usually encountered when the hash table is more than full. Quadratic probing is widely applied in the Berkeley Fast File System to allocate free blocks. Limitation: at most half of the table can be used as alternative locations to resolve collisions. This means that once the table is more than half full, it's difficult to find an empty spot. This new problem is known as secondary clustering because elements that hash to the same hash key will always probe the same alternative cells. DOUBLE HASHING Double hashing uses the idea of applying a second hash function to the key when a collision occurs. The result of the second hash function will be the number of positions forms the point of collision to insert. There are a couple of requirements for the second function: It must never evaluate to 0 must make sure that all cells can be probed. Double hashing can be done using : (hash1(key) + i * hash2(key)) % TABLE_SIZE Here hash1() and hash2() are hash functions and TABLE_SIZE is size of hash table. (We repeat by increasing i when collision occurs) First hash function is typically hash1(key) = key % TABLE_SIZE A popular second hash function is : hash2(key) = PRIME – (key % PRIME) where PRIME is a prime smaller than the TABLE_SIZE. Example 1 Example 2 Example 3 REHASHING When the hash table becomes nearly full, the number of collisions increases, thereby degrading the performance of insertion and search operations. In such cases, a better option is to create a new hash table with size double of the original hash table. All the entries in the original hash table will then have to be moved to the new hash table. This is done by taking each entry, computing its new hash value, and then inserting it in the new hash table. Though rehashing seems to be a simple process, it is quite expensive and must therefore not be done frequently. Advantage: A programmer doesn‟t worry about table system.Simple to implement Can be used in other data structure as well The new size of the hash table: should also be prime will be used to calculate the new insertion spot (hence the name rehashing) This is a very expensive operation! O(N) since there are N elements to rehash and the table size is roughly 2N. This is ok though since it doesn't happen that often. The question becomes when should the rehashing be applied? Some possible answers: once the table becomes half full once an insertion fails once a specific load factor has been reached, where load factor is the ratio of thenumber of elements in the hash table to the table size How Rehashing is done? Rehashing can be done as follows:     For each addition of a new entry to the map, check the load factor. If it’s greater than its pre-defined value (or default value of 0.75 if not given), then Rehash. For Rehash, make a new array of double the previous size and make it the new bucketarray. Then traverse to each element in the old bucketArray and call the insert() for each so as to insert it into the new larger bucket array. Consider the hash table of size 5 given below. The hash function used is h(x) = x % 5. Rehash the entries into to a new hash table. 0 1 2 3 4 26 31 43 17 Note that the new hash table is of 10 locations, double the size of the original table. 01 2 3 4 5 6 7 8 9 Now, rehash the key values from the old hash table into the new one using hash function—h(x) = x % 10. 01 31 2 3 4 43 5 6 26 7 8 9 17 EXTENDIBLE HASHING  Extendible Hashing is a mechanism for altering the size of the hash table to accommodatenew entries when buckets overflow.  Common strategy in internal hashing is to double the hash table and rehash each entry. However, this technique is slow, because writing all pages to disk is too expensive.  Therefore, instead of doubling the whole hash table, we use a directory of pointers to buckets, and double the number of buckets by doubling the directory, splitting just thebucket that overflows.  Since the directory is much smaller than the file, doubling it is much cheaper. Only onepage of keys and pointers is split Extendible Hashing is a dynamic hashing method wherein directories, and buckets are used to hash data. It is an aggressively flexible method in which the hash function also experiences dynamic changes. Main features of Extendible Hashing: The main features in this hashing technique are:   Directories: The directories store addresses of the buckets in pointers. An id is assigned to each directory which may change each time when Directory Expansion takes place. Buckets: The buckets are used to hash the actual data. Basic Structure of Extendible Hashing: Frequently used terms in Extendible Hashing:      Directories: These containers store pointers to buckets. Each directory is given a unique id which may change each time when expansion takes place. The hash function returns this directory id which is used to navigate to the appropriate bucket. Number of Directories = 2^Global Depth. Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more than one pointers to it if its local depth is less than the global depth. Global Depth: It is associated with the Directories. They denote the number of bits which are used by the hash function to categorize the keys. Global Depth = Number of bits in directory id. Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is associated with the buckets and not the directories. Local depth in accordance with the global depth is used to decide the action that to be performed in case an overflow occurs. Local Depth is always less than or equal to the Global Depth. Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the bucket is split into two parts.  Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory Expansion is performed when the local depth of the overflowing bucket is equal to the global depth. Basic Working of Extendible Hashing:        Step 1 – Analyze Data Elements: Data elements may exist in various forms eg. Integer, String, Float, etc.. Currently, let us consider data elements of type integer. eg: 49. Step 2 – Convert into binary format: Convert the data element in Binary form. For string elements, consider the ASCII equivalent integer of the starting character and then convert the integer into binary form. Since we have 49 as our data element, its binary form is 110001. Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hashdirectory is 3. Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary number and match it to the directory id. Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function will return 3 LSBs of 110001 viz. 001. Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id 001. Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket overflows. If an overflow is encountered, go to step 7 followed by Step 8, otherwise, go to step 9. Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while inserting data in the buckets, it might happen that the Bucket overflows. In such cases, we need to follow an appropriate procedure to avoid mishandling of data. First, Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.  Case1: If the local depth of the overflowing Bucket is equal to the global depth, then Directory Expansion, as well as Bucket Split, needs to be performed. Then increment the global depth and the local depth value by 1.    And, assign appropriate pointers. Directory expansion will double the number of directories present in the hash structure. Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only the local depth value by 1. And, assign appropriate pointers. Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are rehashed w.r.t the new global depth of the directory. Step 9 – The element is successfully hashed. Example based on Extendible Hashing: Now, let us consider a prominent example of hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26. Bucket Size: 3 (Assume) Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.  Solution: First, calculate the binary forms of each of the given numbers. 16- 10000 4- 00100 6- 00110 22- 10110 24- 11000 10- 01010 31- 11111 7- 00111 9- 01001 20- 10100 26- 11010  Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like this:  Inserting 16: The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory with id=0.  Inserting 4 and 6: Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:  Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0 is already full. Hence, Over Flow occurs. As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and directory expansion takes place. Also, rehashing of numbers present in the overflowing bucket takes place after the split. And, since the global depth is incremented by 1, now,the global depth is 2. Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ] *Notice that the bucket which was underflow has remained untouched. But, since the number of directories has doubled, we now have 2 directories 01 and 11 pointing to the same bucket. This is because the local-depth of the bucket has remained 1. And, any bucket having a local depth less than the global depth is pointed-to by more than one directories.  Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00 and 10. Here, we encounter no overflow condition.  Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01 or 11 in their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11. We do not encounter any overflow condition here.  Inserting 20: Insertion of data element 20 (10100) will again cause the overflow problem. 20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since the local depth of the bucket = global-depth, directory expansion (doubling) takes place along with bucket splitting. Elements present in overflowing bucket are rehashed with the new global depth. Now, the new Hash table looks like this:  Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26 best fits in the bucket pointed out by directory 010.   The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of bucket < Global depth (2<3), directories are not doubled but, only the bucket is split and elements are rehashed. Finally, the output of hashing the given list of numbers is obtained. Hashing of 11 Numbers is Thus Completed. Key Observations: 1. A Bucket will have more than one pointers pointing to it if its local depth is less than the global depth. 2. When overflow condition occurs in a bucket, all the entries in the bucket are rehashed with a new local depth. 3. If Local Depth of the overflowing bucket 4. The size of a bucket cannot be changed after the data insertion process begins. Advantages: 1. Data retrieval is less expensive (in terms of computing). 2. No problem of Data-loss since the storage capacity increases dynamically. 3. With dynamic changes in hashing function, associated old values are rehashed w.r.t the new hash function. Limitations Of Extendible Hashing: 1. The directory size may increase significantly if several records are hashed on the same directory while keeping the record distribution non-uniform. 2. Size of every bucket is fixed. 3. Memory is wasted in pointers when the global depth and local depth difference becomes drastic. 4. This method is complicated to code.

Searching, Sorting, and Hashing Techniques

Related documents

Products

Support

Searching, Sorting, and Hashing Techniques

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib