Data Types and Data Structures • Data Types – Containers – Dictionaries – Priority Queue • Data Structures – Hash Tables – Binary Search Trees Data types & structures There are numerous options for data structures for many commonly used abstract data types: Containers Dictionaries Priority Queues Changing data structures should not change the correctness of a program, but it can have a dramatic effect on the speed. Choosing a Data Structure It is important to choose the proper data structure when you first design an algorithm. There are many data structures that can handle common operations: insertion, deletion, sorting, searching, finding the maximum or minimum, predecessor or successor, etc. Different data structure will each take their own time for the different operations. Guidelines... • Building an algorithm around a properly chosen data structure leads to both a clean algorithm and good performance • Using an incorrect data structure can be disastrous, but you don’t always need the best structure. • Sorting is at the heart of many good algorithms. • Common algorithm design paradigms include divide-andconquer, randomization, incremental construction, and dynamic programming. Fundamental Data Types An abstract data type is a collection of well-defined operations that can be performed on a particular structure. Different data structures make different tradeoffs that make certain operations (say, insertion ) faster at the cost of others (say, searching.) Often there will be other considerations that will make one structure more desirable over others. Containers • Hold data for later retrieval • Operations: – Insert(item) – Retrieve(); typically removing item from container • Simple data structures for implementing containers – Stack: LIFO – Queue: FIFO – Table: retrieve by index • Implementation – Linked list or array Dictionaries • Dictionaries are a form of container that permits access to data items by content (key). • Operations: – Insert(key) – delete(pointer to item) – search(key) • Linked list implementation (no sorting) – Insert: – Delete: – Search • Sorted array implementation – Insert: – Delete: – Search: Priority Queues Insert(x) : Given an item x, insert it into the priority Queue. Find-Maximum( ) : Return the item with the maximal priority. Delete-Maximum( ) : Remove the item from the queue whose key is maximum. Data Structures • Ways to implement data types – – – – – Linked lists Arrays with auxilary data Hash table Binary search tree Others, of course Hash Tables • Maintain an array to hold your items • “Hash” the key to determine the index the specific item should be stored at • Good hash functions • Methods for dealing with collisions – Chaining • Universal hash functions – Open addressing Direct-address hash table • Assumptions – Universe of keys is small (size m) – Set of keys can be mapped to {0, 1, …, m-1} – No elements have the same key • Use an array of size m – Array contents can be pointer to element – Array can directly store element Hash Functions • Problem with direct-addressed tables – Universe of possible keys U is too large – Set of keys used K may be much smaller • Hash function – Use an array of size Q(m) – Use function h(k) = x to determine slot x – h: U {0, 1, …, m-1} • Collision – When h(k1) = h(k2) Good Hash Functions • Each key is equally likely to hash to any of the m slots independently of where any other key has hashed to • Difficult to achieve as this requires knowledge of distribution of keys • Good characteristics – Must be able to evaluate quickly – May want keys that are “close” to map to slots that are far apart Hashing by Height 1’ 2’ 3’ 4’ 5’ 6’ 7’ 8’ 9’ Collisions unavoidable Even if we have a good function, we will still have collisions: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Chaining • Create a linked list to store all elements that map to same table slot • Running time – Insert(T,k): how long? what assumptions? – Search(T,k): how long? – Delete(T,x): pointer to element x, how long, what assumptions? Search time • Notation – n items – m slots – load factor a = n/m • Worst-case search time? – What is worst case? • Expected search time – Simple uniform hashing: each element is equally likely to hash to any of the m slots, independent of where any other element has hashed to. – Expected search time? Universal hashing • In the worst-case, for any hash function, the keys may be exactly the worst-case for your function • Avoid this by choosing the hash function randomly independent of the keys to be hashed • Key distinction from probabilistic analysis – Universal hash function will work well with high probability on EVERY input instance but may perform poorly with low probablity on EVERY input instance – Probabilistic analysis of static hash function h says h will work well on most input instances every time but may perform poorly on some input instances every time Definition and analysis • Let H be a finite collection of hash functions that map U into {0, …, m-1} • This collection is universal if for each pair of distinct keys k and q in U, the number of hash functions h in H for which h(k) = h(q) is at most |H|/m. • If we choose our hash function randomly from H, this implies that there is at most a 1/m chance that h(k) = h(q). • This leads to the expect length of a chain being n/m – Note we assume chaining and not open addressing in analysis An example of universal hash functions • Choose prime p larger than all possible keys • Let Zp = {0, …, p-1} and Zp* = {1, …, p-1} – Clearly p > m. Why? • ha,b for any a in Zp* and b in Zp – ha,b(k) = ((ak+b) mod p) mod m • Hp,m = {ha,b | a in Zp* and b in Zp} – This family has a total of p(p-1) hash functions • This family of hash functions is universal Open addressing • Store all elements in the table • Probe the hash table in event of a collision • Key idea: probe sequence is NOT the same for each element, depends on initial key • h: U x {0, 1, …, m-1} {0, 1, …, m-1} • Permutation requirement – h(k,0), h(k,1), …, h(k,m-1) is a permutation of (0, …, m-1) Operations • Insert, search straightforward • Why can we not simply mark a slot as deleted? – If keys need to be deleted, open addressing may not be the right choice Probing schemes • uniform hashing: each of m! permutations equally likely – not typically achieved • linear probing: h(k,i) = (h’(k) + i) mod m – Clustering effect – Only m possible probe sequences are considered • quadratic probing: h(k,i) = (h’(k)+ci+di2) mod m – constraints on c, d, m – better than linear probing as clustering effect is not as bad – Only m possible probe sequences are considered, and keys that map to same position do have identical probe sequences • double hashing: h(k,i) = (h(k) + iq(k)) mod m – q(k) must be relatively prime wrt m – m2 probe sequences considered – Much closer to uniform hashing Search time • Preliminaries – n elements, m slots, a = n/m with n <= m – Assumption of uniform hashing • Expected search time on a miss – Given that h(k,i) is non-empty, what is the probability that h(k,i+1) is empty? – What is expected search time then? • Expect insertion time is essentially the same. Why? • Expected search time on a hit – If entry was ith element added, expected search time is 1/(1 – i/m) = m/(m-i) – Sum this over all m and you get 1/a (Hm – Hm-n) – This can be bounded by 1/a ln 1/(1-a) Binary search trees • Supports search, min, max, predecessor, successor, insert, delete, and list all efficiently • Thus can be used for more than just dictionary applications • Basic tree property – For any node x • left subtree has nodes <= x • right subtree has nodes >= x Binary Trees Example Search Trees Operations • Search procedure? – search time? • Minimum node in tree rooted at node x? – search time? • Maximum node in tree rooted at node x? – search time? • Listing all nodes in sorted order? – time to list? Successor and Predecessor Successor: Find the minimal entry in the right sub-tree, if there is a right sub-tree. Otherwise find the first ancestor v such that the entry is in v’s left sub-tree. Predecessor: Find the maximal entry in the left sub-tree, if there is a left sub-tree. Otherwise find the first ancestor v such that the entry is in v’s right sub-tree. In either test, if the root node is reached, no predecessor/ successor exists. Simple Insertion and Deletion Insertion: Traverse the tree as you would when searching. When the required branch does not exist, attach the new entry at that location. Deletion: Three possible cases exist: a) Entry is a leaf : Just delete it. b) Entry has one child : Remove entry replacing it with child. c) Entry had two children : Replace entry with successor. Successor has at most one child (why?); use step a or b on it. Simple binary search trees • What is the expected height of a binary search tree? • Difficult to compute if we allow both insertions and deletions • With insertions, analysis of section 12.4 shows that expected height is O(log n) Tree-Balancing Algorithms • Red-Black Trees • Splay Trees • Others – AVL Trees – 2-3 Trees and 2-3-4 Trees Manipulating Search Trees Red-Black Trees • All nodes in the tree are either red or black. • Every null-child is included and colored black. • All red nodes must have two black children. • Every path from the root to a leaf must have the same number of black nodes. How balanced of a tree will this produce? How hard will it be to maintain? Example Red-Black Tree Splay trees • No adjustment is done in a splay tree when nodes are inserted or removed. • All rotations occur within the Search function the element being searched for is rotated to the root of the tree. • Individual operations may take O(n) time • However, it can be shown that any sequence of m operations including n insertions starting with an empty tree take O(m log n) time Splay trees • Dynamic optimality conjecture: splay trees are as asymptotically fast on any sequence of operations as any other type of search tree with rotations. • What does this mean? – Worst case sequence of splay tree operations takes amortized O(log n) time per operation – Some sequences of operations take less. • Accessing the same ten items over and over again – Splay tree should then take less on these sequences as well. • One special case that has been proven: – search in order from the smallest key to the largest key, the total time for all n operations is O(n). Splay Tree Example Splay Tree Example Specialized Data Structures • • • • • Strings Geometric shapes Graphs Sets Schedules