Fundamentals of Python: From First Programs Through Data

advertisement

Fundamentals of Python:

From First Programs Through Data

Structures

Chapter 19

Unordered Collections: Sets and

Dictionaries

Objectives

After completing this chapter, you will be able to:

• Implement a set type and a dictionary type using lists

• Explain how hashing can help a programmer achieve constant access time to unordered collections

• Explain strategies for resolving collisions during hashing, such as linear probing, quadratic probing, and bucket/chaining

Fundamentals of Python: From First Programs Through Data Structures 2

Objectives (continued)

After completing this chapter, you will be able to:

(continued)

• Use a hashing strategy to implement a set type and a dictionary type

• Use a binary search tree to implement a sorted set type and a sorted dictionary type

Fundamentals of Python: From First Programs Through Data Structures 3

Using Sets

• A set is a collection of items in no particular order

• Most typical operations:

– Return the number of items in the set

– Test for the empty set (a set that contains no items)

– Add an item to the set

– Remove an item from the set

– Test for set membership

– Obtain the union of two sets

– Obtain the intersection of two sets

– Obtain the difference of two sets

Fundamentals of Python: From First Programs Through Data Structures 4

Using Sets (continued)

Fundamentals of Python: From First Programs Through Data Structures 5

The Python set Class

Fundamentals of Python: From First Programs Through Data Structures 6

The Python set Class (continued)

Fundamentals of Python: From First Programs Through Data Structures 7

A Sample Session with Sets

Fundamentals of Python: From First Programs Through Data Structures 8

A Sample Session with Sets

(continued)

Fundamentals of Python: From First Programs Through Data Structures 9

Applications of Sets

• Sets have many applications in the area of data processing

– Example: In database management, answer to query that contains conjunction of two keys could be constructed from intersection of sets of items associated with those keys

Fundamentals of Python: From First Programs Through Data Structures 10

Implementations of Sets

• Arrays and lists may be used to contain the data items of a set

– A linked list has the advantage of supporting constant-time removals of items

• Once they are located in the structure

• Hashing attempts to approximate random access into an array for insertions, removals, and searches

Fundamentals of Python: From First Programs Through Data Structures 11

Relationship Between Sets and

Dictionaries

• A dictionary is an unordered collection of elements called entries

– Each entry consists of a key and an associated value

– A dictionary’s keys must be unique, but its values may be duplicated

• One can think of a dictionary as having a set of keys

Fundamentals of Python: From First Programs Through Data Structures 12

List Implementations of Sets and

Dictionaries

• The simplest implementations of sets and dictionaries use lists

• This section presents these implementations and assesses their run-time performance

Fundamentals of Python: From First Programs Through Data Structures 13

Sets

• List implementation of a set

Fundamentals of Python: From First Programs Through Data Structures 14

Dictionaries

• Our list-based implementation of a dictionary is called ListDict

– The entries in a dictionary consist of two parts, a key and a value

• A list implementation of a dictionary behaves in many ways like a list implementation of a set

Fundamentals of Python: From First Programs Through Data Structures 15

Dictionaries (continued)

Fundamentals of Python: From First Programs Through Data Structures 16

Dictionaries (continued)

Fundamentals of Python: From First Programs Through Data Structures 17

Dictionaries (continued)

Fundamentals of Python: From First Programs Through Data Structures 18

Complexity Analysis of the List

Implementations of Sets and

Dictionaries

• The list implementations of sets and dictionaries require little programmer effort

– Unfortunately, they do not perform well

• Basic accessing methods must perform a linear search of the underlying list

– Each basic accessing method is O( n)

Fundamentals of Python: From First Programs Through Data Structures 19

Hashing Strategies

• Key-to-address transformation or a hashing function

– Acts on a given key by returning its relative position in an array

• Hash table

– An array used with a hashing strategy

• Collision

– Placement of different keys at the same array index

Fundamentals of Python: From First Programs Through Data Structures 20

Hashing Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures 21

Hashing Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures 22

The Relationship of Collisions to

Density

• Density

– The number of keys relative to the length of an array

• As the density decreases, so does the probability of collisions

• Keeping a low load factor even (say, below .2) seems like a good way to avoid collisions

– Cost of memory incurred by load factors below .5 is probably prohibitive for data sets of millions of items

– Even load factors below .5 cannot prevent many collisions from occurring for some data sets

Fundamentals of Python: From First Programs Through Data Structures 23

Hashing with Non-Numeric Keys

• Try returning the sum of the ASCII values in the string

• This method has effect of producing same keys for anagrams

– Strings that contain same characters, but in different order

• First letters of many words in English are unevenly distributed

– This might have the effect of weighting or biasing the sums generated

Fundamentals of Python: From First Programs Through Data Structures 24

Hashing with Non-Numeric Keys

(continued)

• One solution:

– If length of string is greater than a certain threshold

• Drop first character from string before computing sum

• Can also subtract the ASCII value of the last character

• Python also includes a standard hash function for use in hashing applications

– Function can receive any Python object as an argument and returns a unique integer

Fundamentals of Python: From First Programs Through Data Structures 25

Hashing with Non-Numeric Keys

(continued)

Fundamentals of Python: From First Programs Through Data Structures 26

Linear Probing

• Linear probing

– Simplest way to resolve a collision

– Search array, starting from collision spot, for the first available position

• At the start of an insertion, the hashing function is run to compute the home index of the item

– If cell at home index is not available, move index to the right to probe for an available cell

– When search reaches last position of array, probing wraps around to continue from the first position

Fundamentals of Python: From First Programs Through Data Structures 27

Linear Probing (continued)

• For retrievals, stop probing process when current array cell is empty or it contains the target item

– If target item is found, its cell is set to DELETED

Fundamentals of Python: From First Programs Through Data Structures 28

Linear Probing (continued)

• Problem: After several insertions/removals, item is farther away from its home index than needs to be

– Increasing the average overall access time

• Two ways to deal with this problem:

– After a removal, shift items on the cell’s right over to the cell’s left until an empty cell, a currently occupied cell, or the home indexes for each item are reached

– Regularly rehash the table (e.g., if load factor is .5)

• Clustering: Occurs when items causing a collision are relocated to the same region within the array

Fundamentals of Python: From First Programs Through Data Structures 29

Linear Probing (continued)

Fundamentals of Python: From First Programs Through Data Structures 30

Quadratic Probing

• To avoid clustering associated with linear probing, we can advance the search for an empty position a considerable distance from the collision point

– Quadratic probing: Increments the home index by the square of a distance on each attempt

• Problem: By jumping over some cells, one or more of them might be missed

– Can lead to some wasted space

Fundamentals of Python: From First Programs Through Data Structures 31

Quadratic Probing (continued)

• Here is the code for insertions, updated to use quadratic probing:

Fundamentals of Python: From First Programs Through Data Structures 32

Chaining

• Items are stored in an array of linked lists ( chains )

– Each item’s key locates the bucket (index) of the chain in which the item resides or is to be inserted

• Retrieval and removal each perform these steps:

– Compute the item’s home index in the array

– Search the linked list at that index for the item

• To insert an item:

– Compute the item’s home index in the array

– If cell is empty, create a node with item and assign the node to cell; else (collision), insert item in chain

Fundamentals of Python: From First Programs Through Data Structures 33

Chaining (continued)

Fundamentals of Python: From First Programs Through Data Structures 34

Complexity Analysis

• Linear probing: Complexity depends on load factor

(D) and tendency of items to cluster

– Worst case (method traverses entire array before locating item’s position): behavior is linear

– Average behavior in searching for an item that cannot be found is (1/2) [1 + 1/(1 – D) 2 ]

• Quadratic probing: Tends to mitigate clustering

– Average search complexity is 1 – log e

(1 – D) – (D /

2) for the successful case and 1 / (1 – D) – D – log e

(1 – D) for the unsuccessful case

Fundamentals of Python: From First Programs Through Data Structures 35

Complexity Analysis (continued)

• Chaining:

– Locating an item consists of two parts:

• Computing home index  constant time behavior

• Searching linked list upon a collision  linear

– Worst case (all items that have collided with each other are in one chain, which is a linked list): O( n )

– If lists are evenly distributed in array and array is fairly large, the second part can be close to constant

– Best case (a chain of length 1 occupies each array cell): O(1)

Fundamentals of Python: From First Programs Through Data Structures 36

Case Study: Profiling Hashing

Strategies

• Request:

– Write a program that allows a programmer to profile different hashing strategies

• Analysis:

– Should allow to gather statistics on number of collisions caused by the hashing strategies

– Other useful information:

• Hash table’s load factor

• Number of probes needed to resolve collisions during linear or quadratic probing

Fundamentals of Python: From First Programs Through Data Structures 37

Case Study: Profiling Hashing

Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures 38

Case Study: Profiling Hashing

Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures 39

Case Study: Profiling Hashing

Strategies (continued)

• Analysis (continued):

– Here are the profiler’s results:

• Design:

– Profiler class requires instance variables to track a table, number of collisions, and number of probes

Fundamentals of Python: From First Programs Through Data Structures 40

Case Study: Profiling Hashing

Strategies (continued)

• Implementation:

Fundamentals of Python: From First Programs Through Data Structures 41

Case Study: Profiling Hashing

Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures 42

Hashing Implementation of

Dictionaries

• HashDict uses the bucket/chaining strategy

– To manage the array, declare three instance variables: _table , _size , and _capacity

Fundamentals of Python: From First Programs Through Data Structures 43

Hashing Implementation of Sets

• The design of the methods for HashSet is also the same as the methods in HashDict , except:

– __contains__ searches for an item (not key)

– add inserts item only if it is not already in the set

– A single iterator method is included instead of separate methods that return keys and values

Fundamentals of Python: From First Programs Through Data Structures 44

Sorted Sets and Dictionaries

• Each item added to a sorted set must be comparable with its other items

– Same applies for keys added to a sorted dictionary

• The iterator for each type of collection guarantees its users access to items or keys in sorted order

• Implementation alternatives:

– List-based: must maintain a sorted list of the items

– Hashing implementation: not feasible

– Binary search tree implementation: generally provide logarithmic access to data items

Fundamentals of Python: From First Programs Through Data Structures 45

Sorted Sets and Dictionaries

(continued)

Fundamentals of Python: From First Programs Through Data Structures 46

Summary

• A set is an unordered collection of items

– Each item is unique

– List-based implementation  linear-time access

– Hashing implementation  constant-time access

• Items in a sorted set can be visited in sorted order

– A tree-based implementation of a sorted set supports logarithmic-time access

• A dictionary is an unordered collection of entries, where each entry consists of a key and a value

– Each key is unique; its values may be duplicated

Fundamentals of Python: From First Programs Through Data Structures 47

Summary (continued)

• A sorted dictionary imposes an ordering by comparison on its keys

• Implementations of both types of dictionaries are similar to those of sets

• Hashing: Technique for locating an item in constant time

– Techniques to resolve collisions: linear collision processing, quadratic collision processing, chaining

– The run-time and memory aspects involve the load factor of the array

Fundamentals of Python: From First Programs Through Data Structures 48

Download