Advanced Data Structure By Kayman 21 Jan 2006 Outline Review of some data structures Array Linked List Sorted Array New stuff 3 of the most important data structures in OI (and your own programming) Binary Search Tree Heap (Priority Queue) Hash Table Review How to measure the merits of a data structure? Time complexity of common operations Function Find(T : DataType) : Element Function Find_Min() : Element Procedure Add(T : DataType) Procedure Remove(E : Element) Procedure Remove_Min() Review - Array Here Element is simply the integer index of the array cell Find(T) Must scan the whole array, O(N) Find_Min() Also need to scan the whole array, O(N) Add(T) Simply add it to the end of the array, O(1) Remove(E) Deleting an element creates a hole Copy the last element to fill the hole, O(1) Remove_Min() Need to Find_Min() then Remove(), O(N) Review - Linked List Element is a pointer to the object Find(T) Find_Min() Just add it to a convenient position (e.g. head), O(1) Remove(E) Scan the whole list, O(N) Add(T) Scan the whole list, O(N) With suitable implementation, O(1) Remove_Min() Need to Find_Min() then Remove(), O(N) Review - Sorted Array Like array, Element is the integer index of the cell Find(T) We can use binary search, O(logN) Find_Min() The first element must be the minimum, O(1) Add(T) First we need to find the correct place, O(logN) Then we need to shift the array by 1 cell, O(N) Remove(E) Deleting an element creates a hole Need to shift the of array by 1 cell, O(N) Remove_Min() Can be O(1) or O(N) depending on choice of implementation Review - Summary Array Find O(N) Find_Min O(N) Add O(1) Remove O(1) Remove_M O(N) in Linked List O(N) O(N) O(1) O(1) O(N) Sorted Array O(logN) O(1) O(N) O(N) O(1) or O(N) If we are going to perform a lot of these operations (e.g. N=100000), none of these is fast enough! Advanced Data Structure Binary Search Tree What is a Binary Search Tree? Use a binary tree to store the data Maintain this property Left Subtree < Node < Right Subtree 11 8 4 15 9 20 Binary Search Tree - Add 11,8,15,9,20,4 11 8 4 15 9 20 Add 11 11 Add 8 11 8 Add 15 11 8 15 Add 9 11 8 15 9 Add 20 11 8 15 9 20 Add 4 11 8 4 15 9 20 Binary Search Tree - Find Find 9 11 8 4 15 9 20 Binary Search Tree - Find Find 10 11 8 4 15 9 20 Binary Search Tree - Remove Case I : Removing a leaf node Easy Binary Search Tree - Remove Remove 9 11 8 4 11 15 9 8 20 4 15 20 Binary Search Tree - Remove Case I : Removing a leaf node Easy Case II : Removing a node with a single child Replace the removed node with its child Binary Search Tree - Remove Remove 15 11 8 4 11 15 9 8 20 4 20 9 Binary Search Tree - Remove Case I : Removing a leaf node Case II : Removing a node with a single child Easy Replace the removed node with its child Case III : Removing a node with 2 children Replace the removed node with the minimum element in the right subtree (or maximum element in the left subtree) This may create a hole again Apply Case I or II Binary Search Tree - Remove Remove 8 11 8 4 11 15 9 9 20 4 15 20 Binary Search Tree - Remove Case I : Removing a leaf node Easy Case II : Removing a node with a single child Replace the removed node with its child Case III : Removing a node with 2 children Replace the removed node with the minimum element in the right subtree (or maximum element in the left subtree) This may create a hole again Apply Case I or II Sometimes you can avoid this by using “Lazy Deletion” Mark a node as removed instead of actually removing it Less coding, performance hit not big if you are not doing this frequently (may even save time) Binary Search Tree - Remove Remove 11 11 8 4 del 15 9 8 20 4 15 9 20 Binary Search Tree - Summary Add() is similar to Find() Find_Min() Just walk to the left, easy Remove_Min() Equivalent to Find_Min() then Remove() Summary Find() : O(logN) Find_Min() : O(logN) Remove_Min() : O(logN) Add() : O(logN) Remove() : O(logN) The BST is “supposed” to behave like that Binary Search Tree - Problems In reality… All these operations are O(logN) only if the tree is balanced Inserting a sorted sequence degenerates into a linked list The real upper bounds Find() : O(N) Find_Min() : O(N) Remove_Min() : O(N) Add() : O(N) Remove() : O(N) Solution AVL Tree, Red Black Tree Use “rotations” to maintain balance Both are difficult to implement, rarely used Advanced Data Structure Heap (Priority Queue) What is a Heap? A (usually) complete binary tree for Priority Queue Enqueue = Add Dequeue = Find_Min and Remove_Min Heap Property Every node’s value is greater than those of its decendants Heap - Implementation Usually we use an array to simulate a heap Assume nodes are indexed 1, 2, 3, ... Parent = [Node / 2] Left Child = Node*2 Right Child = Node*2 + 1 Heap - Add Append the new element at the end Shift it up until the heap property is restored Heap - Remove_Min Replace the root with the last element Shift it down until the heap property is restored Heap - Build_Heap Apply shift down function to half nodes from middle to top Heap - Summary Find() is usually not supported by a heap Remove() is equivalent to applying Remove_Min() on a subtree You may scan the whole tree / array if you really want Remember that any subtree of a heap is also a heap Summary Find() : O(N) // We usually don’t use Heap for this Find_Min() : O(1) Remove_Min() : O(logN) Add() : O(logN) Remove() : O(logN) Advanced Data Structure Hash Table What is a Hash Table? Question We have a Mark Six result (6 integers in the range 1..49) We want to check if our bet matches it What is the most efficient way? Answer Use a boolean array with 49 cells Checking a number is O(1) Problem What if the range of number is very large? What if we need to store strings? Solution Use a “Hash Function” to compress the range of values Hash Table Suppose we need to store values between 0 and 99, but only have an array with 10 cells We can map the values [0,99] to [0,9] by taking modulo 10. The result is the “Hash Value” Adding, finding and removing an element are O(1) It is even possible to map the strings to integers, e.g. “ATE” to (1*26*26+20*26+5) mod 10 Hash Table - Collision But this approach has an inherent problem What happens if two data has the same hash value? Two major methods to deal with this Chaining (Also called Open Hashing) Open Addressing (Also called Closed Hashing) Hash Table - Chaining Keep a link list at each hash table cell On average, Add / Find / Remove is O(1+a) a = Load Factor = # of stored elements / # of cells If hash function is “random” enough, usually can get the average case Hash Table - Open Addressing If you don’t want to implement a linked list… An alternative is to skip a cell if it is occupied The following diagram illustrates “Linear Probing” Hash Table - Open Addressing Find() must continue until a blank cell is reached Remove() must use Lazy Deletion, otherwise further operations may fail Hash Table - Summary Find_Min() and Remove_Min() are usually not supported in a Hash Table You may scan the whole tree / array if you really want For Chaining Find() : O(1+a) Add() : O(1+a) Remove() : O(1+a) For Open Adressing Find() : O(1 / 1-a) Add() : O(1 / 1-a) Remove() : O(ln(1/1-a)/a + 1/a) Both are close to O(1) if a is kept small (< 50%) Additional Information Judge problems Past contest problems 1020 – Left Join 1021 – Inner Join 1019 – Addition II 1090 – Diligent NOI2004 Day 1 – Cashier Good place to find related information - Wikipedia http://en.wikipedia.org/wiki/Binary_search_tree http://en.wikipedia.org/wiki/Binary_heap http://en.wikipedia.org/wiki/Hash_table