Fundamentals of Python:
From First Programs Through Data
Structures
Chapter 19
Unordered Collections: Sets and
Dictionaries
Objectives
After completing this chapter, you will be able to:
• Implement a set type and a dictionary type using lists
• Explain how hashing can help a programmer achieve constant access time to unordered collections
• Explain strategies for resolving collisions during hashing, such as linear probing, quadratic probing, and bucket/chaining
Fundamentals of Python: From First Programs Through Data Structures 2
Objectives (continued)
After completing this chapter, you will be able to:
(continued)
• Use a hashing strategy to implement a set type and a dictionary type
• Use a binary search tree to implement a sorted set type and a sorted dictionary type
Fundamentals of Python: From First Programs Through Data Structures 3
Using Sets
• A set is a collection of items in no particular order
• Most typical operations:
– Return the number of items in the set
– Test for the empty set (a set that contains no items)
– Add an item to the set
– Remove an item from the set
– Test for set membership
– Obtain the union of two sets
– Obtain the intersection of two sets
– Obtain the difference of two sets
Fundamentals of Python: From First Programs Through Data Structures 4
Using Sets (continued)
Fundamentals of Python: From First Programs Through Data Structures 5
The Python set Class
Fundamentals of Python: From First Programs Through Data Structures 6
The Python set Class (continued)
Fundamentals of Python: From First Programs Through Data Structures 7
A Sample Session with Sets
Fundamentals of Python: From First Programs Through Data Structures 8
A Sample Session with Sets
(continued)
Fundamentals of Python: From First Programs Through Data Structures 9
Applications of Sets
• Sets have many applications in the area of data processing
– Example: In database management, answer to query that contains conjunction of two keys could be constructed from intersection of sets of items associated with those keys
Fundamentals of Python: From First Programs Through Data Structures 10
Implementations of Sets
• Arrays and lists may be used to contain the data items of a set
– A linked list has the advantage of supporting constant-time removals of items
• Once they are located in the structure
• Hashing attempts to approximate random access into an array for insertions, removals, and searches
Fundamentals of Python: From First Programs Through Data Structures 11
Relationship Between Sets and
Dictionaries
• A dictionary is an unordered collection of elements called entries
– Each entry consists of a key and an associated value
– A dictionary’s keys must be unique, but its values may be duplicated
• One can think of a dictionary as having a set of keys
Fundamentals of Python: From First Programs Through Data Structures 12
List Implementations of Sets and
Dictionaries
• The simplest implementations of sets and dictionaries use lists
• This section presents these implementations and assesses their run-time performance
Fundamentals of Python: From First Programs Through Data Structures 13
Sets
• List implementation of a set
Fundamentals of Python: From First Programs Through Data Structures 14
Dictionaries
• Our list-based implementation of a dictionary is called ListDict
– The entries in a dictionary consist of two parts, a key and a value
• A list implementation of a dictionary behaves in many ways like a list implementation of a set
Fundamentals of Python: From First Programs Through Data Structures 15
Dictionaries (continued)
Fundamentals of Python: From First Programs Through Data Structures 16
Dictionaries (continued)
Fundamentals of Python: From First Programs Through Data Structures 17
Dictionaries (continued)
Fundamentals of Python: From First Programs Through Data Structures 18
Complexity Analysis of the List
Implementations of Sets and
Dictionaries
• The list implementations of sets and dictionaries require little programmer effort
– Unfortunately, they do not perform well
• Basic accessing methods must perform a linear search of the underlying list
– Each basic accessing method is O( n)
Fundamentals of Python: From First Programs Through Data Structures 19
Hashing Strategies
• Key-to-address transformation or a hashing function
– Acts on a given key by returning its relative position in an array
• Hash table
– An array used with a hashing strategy
• Collision
– Placement of different keys at the same array index
Fundamentals of Python: From First Programs Through Data Structures 20
Hashing Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures 21
Hashing Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures 22
The Relationship of Collisions to
Density
• Density
– The number of keys relative to the length of an array
• As the density decreases, so does the probability of collisions
• Keeping a low load factor even (say, below .2) seems like a good way to avoid collisions
– Cost of memory incurred by load factors below .5 is probably prohibitive for data sets of millions of items
– Even load factors below .5 cannot prevent many collisions from occurring for some data sets
Fundamentals of Python: From First Programs Through Data Structures 23
Hashing with Non-Numeric Keys
• Try returning the sum of the ASCII values in the string
• This method has effect of producing same keys for anagrams
– Strings that contain same characters, but in different order
• First letters of many words in English are unevenly distributed
– This might have the effect of weighting or biasing the sums generated
Fundamentals of Python: From First Programs Through Data Structures 24
Hashing with Non-Numeric Keys
(continued)
• One solution:
– If length of string is greater than a certain threshold
• Drop first character from string before computing sum
• Can also subtract the ASCII value of the last character
• Python also includes a standard hash function for use in hashing applications
– Function can receive any Python object as an argument and returns a unique integer
Fundamentals of Python: From First Programs Through Data Structures 25
Hashing with Non-Numeric Keys
(continued)
Fundamentals of Python: From First Programs Through Data Structures 26
Linear Probing
• Linear probing
– Simplest way to resolve a collision
– Search array, starting from collision spot, for the first available position
• At the start of an insertion, the hashing function is run to compute the home index of the item
– If cell at home index is not available, move index to the right to probe for an available cell
– When search reaches last position of array, probing wraps around to continue from the first position
Fundamentals of Python: From First Programs Through Data Structures 27
Linear Probing (continued)
• For retrievals, stop probing process when current array cell is empty or it contains the target item
– If target item is found, its cell is set to DELETED
Fundamentals of Python: From First Programs Through Data Structures 28
Linear Probing (continued)
• Problem: After several insertions/removals, item is farther away from its home index than needs to be
– Increasing the average overall access time
• Two ways to deal with this problem:
– After a removal, shift items on the cell’s right over to the cell’s left until an empty cell, a currently occupied cell, or the home indexes for each item are reached
– Regularly rehash the table (e.g., if load factor is .5)
• Clustering: Occurs when items causing a collision are relocated to the same region within the array
Fundamentals of Python: From First Programs Through Data Structures 29
Linear Probing (continued)
Fundamentals of Python: From First Programs Through Data Structures 30
Quadratic Probing
• To avoid clustering associated with linear probing, we can advance the search for an empty position a considerable distance from the collision point
– Quadratic probing: Increments the home index by the square of a distance on each attempt
• Problem: By jumping over some cells, one or more of them might be missed
– Can lead to some wasted space
Fundamentals of Python: From First Programs Through Data Structures 31
Quadratic Probing (continued)
• Here is the code for insertions, updated to use quadratic probing:
Fundamentals of Python: From First Programs Through Data Structures 32
Chaining
• Items are stored in an array of linked lists ( chains )
– Each item’s key locates the bucket (index) of the chain in which the item resides or is to be inserted
• Retrieval and removal each perform these steps:
– Compute the item’s home index in the array
– Search the linked list at that index for the item
• To insert an item:
– Compute the item’s home index in the array
– If cell is empty, create a node with item and assign the node to cell; else (collision), insert item in chain
Fundamentals of Python: From First Programs Through Data Structures 33
Chaining (continued)
Fundamentals of Python: From First Programs Through Data Structures 34
Complexity Analysis
• Linear probing: Complexity depends on load factor
(D) and tendency of items to cluster
– Worst case (method traverses entire array before locating item’s position): behavior is linear
– Average behavior in searching for an item that cannot be found is (1/2) [1 + 1/(1 – D) 2 ]
• Quadratic probing: Tends to mitigate clustering
– Average search complexity is 1 – log e
(1 – D) – (D /
2) for the successful case and 1 / (1 – D) – D – log e
(1 – D) for the unsuccessful case
Fundamentals of Python: From First Programs Through Data Structures 35
Complexity Analysis (continued)
• Chaining:
– Locating an item consists of two parts:
• Computing home index constant time behavior
• Searching linked list upon a collision linear
– Worst case (all items that have collided with each other are in one chain, which is a linked list): O( n )
– If lists are evenly distributed in array and array is fairly large, the second part can be close to constant
– Best case (a chain of length 1 occupies each array cell): O(1)
Fundamentals of Python: From First Programs Through Data Structures 36
Case Study: Profiling Hashing
Strategies
• Request:
– Write a program that allows a programmer to profile different hashing strategies
• Analysis:
– Should allow to gather statistics on number of collisions caused by the hashing strategies
– Other useful information:
• Hash table’s load factor
• Number of probes needed to resolve collisions during linear or quadratic probing
Fundamentals of Python: From First Programs Through Data Structures 37
Case Study: Profiling Hashing
Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures 38
Case Study: Profiling Hashing
Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures 39
Case Study: Profiling Hashing
Strategies (continued)
• Analysis (continued):
– Here are the profiler’s results:
• Design:
– Profiler class requires instance variables to track a table, number of collisions, and number of probes
Fundamentals of Python: From First Programs Through Data Structures 40
Case Study: Profiling Hashing
Strategies (continued)
• Implementation:
Fundamentals of Python: From First Programs Through Data Structures 41
Case Study: Profiling Hashing
Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures 42
Hashing Implementation of
Dictionaries
• HashDict uses the bucket/chaining strategy
– To manage the array, declare three instance variables: _table , _size , and _capacity
Fundamentals of Python: From First Programs Through Data Structures 43
Hashing Implementation of Sets
• The design of the methods for HashSet is also the same as the methods in HashDict , except:
– __contains__ searches for an item (not key)
– add inserts item only if it is not already in the set
– A single iterator method is included instead of separate methods that return keys and values
Fundamentals of Python: From First Programs Through Data Structures 44
Sorted Sets and Dictionaries
• Each item added to a sorted set must be comparable with its other items
– Same applies for keys added to a sorted dictionary
• The iterator for each type of collection guarantees its users access to items or keys in sorted order
• Implementation alternatives:
– List-based: must maintain a sorted list of the items
– Hashing implementation: not feasible
– Binary search tree implementation: generally provide logarithmic access to data items
Fundamentals of Python: From First Programs Through Data Structures 45
Sorted Sets and Dictionaries
(continued)
Fundamentals of Python: From First Programs Through Data Structures 46
Summary
• A set is an unordered collection of items
– Each item is unique
– List-based implementation linear-time access
– Hashing implementation constant-time access
• Items in a sorted set can be visited in sorted order
– A tree-based implementation of a sorted set supports logarithmic-time access
• A dictionary is an unordered collection of entries, where each entry consists of a key and a value
– Each key is unique; its values may be duplicated
Fundamentals of Python: From First Programs Through Data Structures 47
Summary (continued)
• A sorted dictionary imposes an ordering by comparison on its keys
• Implementations of both types of dictionaries are similar to those of sets
• Hashing: Technique for locating an item in constant time
– Techniques to resolve collisions: linear collision processing, quadratic collision processing, chaining
– The run-time and memory aspects involve the load factor of the array
Fundamentals of Python: From First Programs Through Data Structures 48