Note to Instructors
This Keynote document contains the slides for “When Words Collide”, Chapter 6 of Explorations in
Computing: An Introduction to Computer Science .
The book invites students to explore ideas in computer science through interactive tutorials where they type expressions in Ruby and immediately see the results, either in a terminal window or a 2D graphics window.
Instructors are strongly encouraged to have a Ruby session running concurrently with Keynote in order to give live demonstrations of the Ruby code shown on the slides.
License
The slides in this Keynote document are based on copyrighted material from Explorations in
Computing: An Introduction to Computer Science , by John S. Conery.
These slides are provided free of charge to instructors who are using the textbook for their courses.
Instructors may alter the slides for use in their own courses, including but not limited to: adding new slides, altering the wording or images found on these slides, or deleting slides.
Instructors may distribute printed copies of the slides, either in hard copy or as electronic copies in
PDF form, provided the copyright notice below is reproduced on the first slide.
© 2012 John S. Conery
✦ Word Lists
✦ Hash Tables
✦ The mod Function Again
✦ Collisions
✦ Hash Table Experiments
Explorations in Computing
© 2012 John S. Conery
✦ When the “spell check” option is turned on an application will indicate misspelled words as soon as you type them
✦ The system uses a wordlist
❖ like a dictionary but it doesn’t have any definitions
❖ simply a list of correctly spelled words
✦ Each time you end a word the application searches the wordlist
??
✦ The list can be very long
❖ a list made from Webster’s 2nd dictionary has almost 235,000 words
✦ The search must be efficient
❖ could be as many as two or three searches every second
✦ Binary search can find a word with at most 18 comparisons
✦ As efficient as binary search is, it is still (theoretically) possible to do a search using fewer comparisons
✦
❖
❖
Suppose there is a special function f
❖ the input to f is any one of n words the output is an integer between 0 and n -1 the function has a very important property: each word maps to a different number
✦ If there is such a function we can use it to store data in an array
❖ save word w at location f ( w )
An array for the 234,936 words from Webster’s 2nd
✦ We can use the array and this special function to implement a spell checker
❖ put every word w at location f ( w )
✦ To see if a string s is a correctly spelled word just look in location f ( s )
❖ if the word there is s we know it is spelled correctly
❖ if we find a different word w then s is not a word in the wordlist
✦ This “search algorithm” requires only one comparison
✦ The function f does not have to order words according to any particular rule
❖ it could order them alphabetically
❖ the words might also be ordered according to their length
❖ the order can even be completely random
✦ All we care about is that this function maps every word to a unique integer
✦ Even though we can’t define a perfect function that assigns each word to a unique location we can come close
✦ A hash function is a function that maps strings to integers
❖ it is possible to define a “perfect” hash function for small sets of words
❖ most practical applications define a function that scatters words randomly throughout an array
❖ use a large array, intentionally leave some locations empty
✦ The phrase “hash function” comes from the way these functions are typically implemented
❖ an input word is chopped into small pieces
❖ each piece is converted to an integer
❖ these numbers are then reassembled to make the location for a word
✦ The characters in strings are based on a numeric code
✦ The most common code: ASCII
❖
❖ pronounced “ass-key” acronym for “American Standard Code for Information Interchange”
✦ Original standard had 128 entries (upper and lower case, digits, punctuation)
❖ modern extended ASCII has 256 symbols (math, accented letters, etc)
From the Wikipedia page for ASCII
✦ Accessing a single element of a string in Ruby returns an ASCII code
>> s = "hello"
=> "hello"
>> s[0]
=> 104
>> s[1]
=> 101
Later we’ll see a method that converts letter codes to numbers between 0 and 25
✦ A table is similar to an array
❖ a technique for organizing a collection of data
❖ access rows of a table according to an index value
✦ The difference: each row in a table can have multiple values
✦ Rows in a table are often called records
Table for storing voter registration information
✦ A hash table is a table where the items are stored using a hash function
❖ choose a column to be the key
❖ key values should be unique
(different in each record)
❖ store records in the row determined by f ( key )
✦ Example: use voter name as the key a table organized by voter name
✦ A common type of table in the “real world” is the index for a book
❖ records are words or phrases and an associated set of page numbers
❖
❖ to find something in the book look it up in the index, then go to the indicated page we don’t scan the book page by page to find information
Go to page 40 to read about creating an array
If the book is a PDF just click the page number....
✦ Google uses a huge index to report results of web searches h (array)
✦ The project for this chapter will be on hash functions
❖
❖ techniques for mapping strings to integers properties of different types of functions
✦ To simplify things we won’t consider tables and how to manage the extra information
❖ we’ll just look at how to organize an “index” and how to resolve collisions
✦ We’ll run some experiments on large wordlists
✦ We can use a Ruby Array object to represent an index
✦ In our other projects, we used Arrays as lists
❖
❖ initialized with a collection of objects
Array grows and shrinks as objects are added or removed
>> a = TestArray.new(5, :colors)
=> ["moccasin", "chocolate", "forest green", "yellow", "navy"]
>> a.length
=> 5
✦ The difference in this project: the Array object will have a fixed size, and some of the rows will be empty
✦ To create an Array with room for a certain number of objects, pass the desired array size in the call to Array.new
>> a = Array.new(10)
=> [nil, nil, nil, ... nil, nil, nil]
>> a.length
=> 10
>> a[1] = "aback"
=> "aback"
>> a[3] = "aardvark"
=> "aardvark"
To add an item to the table, simply use an assignment statement
>> a
=> [nil, "aback", nil, "aardvark", nil, nil, ...]
After adding two words, there are still eight empty rows
h
0
✦ The HashLab module defines a method named ord that converts a letter to a number between 0 and 25
>> s = "hello"
=> "hello"
>> s[0].ord
=> 7
>> s[1].ord
=> 4 ord is short for “ordinal value”
✦ We can use this method to create a trivial hash function
❖ put words starting with “a” in row 0
❖ starting with “b” in row 1
❖ ...
✦ But this function works only if the table has exactly 26 rows
h
1
✦
❖
❖
Here is one way to use the same idea for a larger table
❖ use two letters, so the table has
26 x 26 = 676 rows
❖ imagine the table is a set of blocks each block has 26 rows first block is for words starting with “a”, second for words starting “b”, ...
✦ Use the second letter to find a row within a block
❖
❖
❖
“aardvark” goes in block 0, row 0
“cnidarian” would go in block 2, row 12
“zymurgy” in block 25, row 24 aardvark absolutely ball zany zymurgy words starting with “a” words starting with “b” words starting with “z”
h
1
✦ The addresses of the first row in each block are 0, 26, 52, ... 624
✦ These numbers have a very useful pattern:
❖ block i starts at address 26 * i
✦ The second letter determines how far past the first row we go def h1(s)
(s[0].ord * 26 + s[1].ord) end block number determined by first letter row within the block determined by the second letter words starting with “a” words starting with “b” words starting with “z”
h
1
✦ Example in IRB:
>> h1("aardvark")
=> 0
>> h1("abcissa")
=> 1
>> h1("cnidarian")
=> 65
>> h1("zany")
=> 650
>> h1("zymurgy")
=> 674 def h1(s) (s[0].ord * 26 + s[1].ord) end words starting with “a” words starting with “b” words starting with “z”
✦ This formula can be extended to any number of letters
❖ e.g. for a five-letter word:
✦ This is the same method used to figure out the value of a string of digits in a positional number system
❖ e.g. the number “723” in octal (base 8) is
✦ The function that weights each letter by a power of 26 is called radix-26
✦ The problem with this scheme, of course, is that radix-26 values for long words can be very large numbers
>> radix26("able")
=> 966
>> radix26("answer")
=> 6272049
>> radix26("aardvark")
=> 203723868
>> radix26("ambivalent")
=> 2516677616589
✦ We don’t want to build tables with over 10 12 rows
✦ One solution:
❖
❖ create an array with an arbitrary number of rows n compute the row for a word s by calculating radix26(s) % n
✦ The result is a number between 0 and n -1
❖ exactly what we’re looking for
❖ row numbers in an array of n items range from 0 to n -1
Example: hash function h
0 and a table with n = 10 rows words starting “a”, “k”, or “u” words starting “b”, “l”, or “v” words starting “c”, “m”, or “w”
✦ The HashLab module has an implementation of the hash function h
0
:
>> include HashLab
=> Object
>> h0("apple", 10)
=> 0
>> h0("banana", 10)
=> 1 def h0(s,n) return s[0].ord % n end
The second argument passed to the method is the table size
>> h0("mango", 10)
=> 2
>> h0("strawberry", 10)
=> 8
“m” => 12
12 mod 10 = 2
✦ The module also has implementations of the other hash functions:
>> h0("banana", 1000)
=> 1
The ordinal value of ‘b’ is 1
>> h1("banana", 1000)
=> 26
>> radix26("banana")
=> 12110202
>> hr("banana", 1000)
=> 202
Strings starting ‘ba’ go in row
1 * 26 + 0
The value of x mod 1000 is determined by the last three digits in x
✦ A trivial hash function h0 uses the first letter of a word s[0].ord
❖ range of values = 0..25
✦ A slightly better function h1 uses the first two letters s[0].ord * 26 + h[1].ord
❖ range = 0..675
✦ A general-purpose function hn uses all the letters, e.g.
s[0].ord * 26 7 + s[1].ord * 26 6 + ...
✦ After computing the product of the letter values find the remainder mod n
❖ n is the number of rows in the table
❖ the result is between 0 and n -1
✦ To add a string to a table, simply compute its hash function and save it in the corresponding row
❖ caveat: make sure the row is empty
✦ Finding a string uses the same strategy
❖ compute the hash function
❖ if there is a string in that row, see if it’s the one we’re looking for
❖ if so, return the row number
Both methods must use the same formula for computing a row number...
✦ Testing the insert method:
>> t = Array.new(10)
=> [nil, nil, ... nil, nil]
>> insert("apple", t)
=> 0
>> insert("mango", t)
=> 2
>> insert("cinnamon", t)
=> nil
>> t
=> ["apple", nil, "mango", nil, ... ]
This version of insert will work with any table size, but it always uses h0
The final version of insert will use other hash functions...
✦ Testing the lookup method, assuming strings have been inserted as shown at right:
>> lookup("apple", t)
=> 0
>> lookup("elderberry", t)
=> 4
>> lookup("orange", t)
=> nil
>> lookup("earthworm", t)
=> nil
Note there are two ways the search will return nil : the row is empty, or the row has a different word that hashes to that row number
✦ It is not uncommon for two or more words to have the same hash value
❖ a collision occurs when the insert method tries to place two words in the same row
✦ Example, using the radix-26 function and a table with 100,000 rows:
❖ “fatherhood” inserted in the table at location 25615
❖ “obstructionist” hashes to same location
1951 Sci Fi Thriller
✦ The hash function using radix-26 does a pretty good job of scattering words around in a table
>> fruits.each { |x| puts x; puts hr(x,100)
} apple
70 banana
2 grape
42 kiwi
48
A good hash function assigns words to random locations
✦ Even a perfectly random assignment can’t avoid all collisions
✦ Can’t we just make a bigger table?
✦ That helps, but the odds are still quite high
✦ Probability of a collision when inserting
100 words into a table of size n n
1000
10,000
100,000
1,000,000 p(collision)
99.29%
39.04%
4.83%
0.49%
1,000 empty rows for every full one, but still almost 5% chance of a collision!
✦ A famous paradox from probability theory explains why the probability of a collision is so high
✦ Assume all birthdays are equally likely
✦
✦
In a room with m people, what are the odds that two or more people have the same birthday?
❖ intuition might tell you its around m / 365 m odds of match
❖ that is the right probability for two people at random having the same birthday
10
20
12%
41%
30 70%
But the real question: the probability that every pair in the room have different birthdays
50
100
97%
99.99996%
Search for “birthday paradox” at Wikipedia
✦ The statistics behind the birthday paradox also apply to hash tables
❖ birthdays: 365 choices
❖ hash tables: n choices ( n is the size of the table)
✦ Two people having the same birthday is the same as two words hashing to the same location in a table
✦ To try this at a party:
❖ start with a blank calendar (an empty table)
❖
❖ as you ask people their birthdays put a check mark next to a day on the calendar (that row in the table has been filled) if someone’s birthday falls on a day that has been checked already then you have found a collision
✦ One strategy for dealing with collisions is to search for a nearby open row
✦ A different strategy is to store a list of words in the array
❖ these lists are known as buckets
✦ When the table is first created every row will be empty
✦ The first time a word is placed in a row the insert method needs to initialize a new array
✦ New items are just appended to the end of the array def insert (s, t) i = h(s, t.length) t[i] = [] if t[i].nil?
t[i] << s return s end
✦ Finding a string s in the table is also very easy
❖ apply the hash function to get the row for s
❖ if the row is empty s is not in the table
❖ otherwise look for s in the bucket at that row def lookup (s, t) i = h(s, t.length) return nil if t[i].nil?
return i if t[i].include?(s) return nil end
A method in the Array class
Does a linear search through t[i] (an array object)
✦ The HashLab module defines a new type of object, called a HashTable
>> include HashLab
=> Object
>> t = HashTable.new(100)
=> #<RubyLabs::HashLab::HashTable: 100 rows, :hr>
✦ A method named insert will add a string to the table:
>> TestArray.new(5, :cars).each { |x| t.insert(x) }
=> ["mercury", "bmw", "jaguar", "vauxhall", "pontiac"]
✦ Call lookup to see if a string is in the table:
>> t.lookup("bmw")
=> 10
✦ This expression gets a list of five car names and adds them to the table:
>> TestArray.new(5, :cars).each { |x| t.insert(x) }
=> ["mercury", "bmw", "jaguar", "vauxhall", "pontiac"]
✦ The print_table methods prints all the non-empty buckets:
>> print_table(t)
10: ["bmw"]
46: ["mercury", "pontiac"]
77: ["jaguar"]
93: ["vauxhall"]
=> nil
How many collisions occurred when this table was built?
✦ When you installed Ruby and the RubyLabs gem on your computer you also installed software called “Tk”
✦ Tk is a set of software modules that creates windows, menus, buttons, and other user interface “widgets”
✦ To see if RubyLabs is set up to use Tk:
>> hello
Hello, you have successfully installed RubyLabs!
=> true
Output from RubyLabs 0.9.4
✦ After you make a HashTable object, you can ask IRB to draw it on the
RubyLabs Canvas:
>> view_table(t)
=> true the view_table method will initialize a new drawing on the RubyLabs Canvas
✦ Now each time you call the object’s insert method you will see the item in the drawing:
>> TestArray.new(5,:fish).each { |x| t.insert(x) }
=> ["icefish", "herring", "anchovy", "brook trout", "chinook"]
(output shown on next slide)
Live Demo
Question: how many strings are in row 7 of the table?
✦ Note that lookup finds a string even if it’s not the first one in a bucket:
>> t.lookup("mercury")
=> 46
>> t.lookup("pontiac")
=> 46
✦ lookup returns nil if the bucket is empty or if a string isn’t in the bucket for its row:
>> t.lookup("opel")
=> nil
>> hr("oldsmobile", 100)
=> 46
>> t.lookup("oldsmobile")
=> nil
✦ Make a small table that uses the h
0 hash function, check to see that strings are placed where expected
>> t = HashTable.new(10, :h0) Pass a hash function name ( :h0 , :h1 , or :hn ) when creating the table
>> TestArray.new(10, :fruits).each { |s| t.insert(s) }
>> print_table(t)
1: ["lemon"]
2: ["cloudberry"]
3: ["date"]
5: ["fig", "pear"]
Are these strings where they should be?
6: ["grapefruit", "grape", "gooseberry"]
8: ["strawberry"]
9: ["tangerine"]
=> nil
✦ A useful feature of the TestArray class -- pass :all instead of a number to get a complete list of strings
>> t = HashTable.new(100)
=> #<RubyLabs::HashLab::HashTable: 100 rows, :hr>
>> TestArray.new(:all, :cars).each { |s| t.insert(s) }
=> ["acura", "alfa romeo", ... "vauxhall", "volkswagen"]
✦ The print_stats method prints information about the structure of the table:
>> t.print_stats
shortest bucket: 0
There are 50 car names longest bucket: 3 empty buckets: 63 mean bucket length: 1.35
Ideally, there would be 50 buckets with 1 name each, and 50 empty buckets
✦ Let’s do an experiment with all the words in the RubyLabs word list
>> words = TestArray.new(:all, :words); puts "done" done
>> words.length
=> 210653
We used this “trick” in an earlier experiment -- it keeps IRB from printing the entire word list
✦ If you’re curious, print some of the words:
>> words[0..9]
=> ["aa", "aal", "aalii", "aam", "aardvark", ... "abacate"]
>> words[-10..-1]
=> ["zymotechnical", ... "zymurgy", "zythem", "zythum"]
✦ Make a table with 100,000 rows and add all the words:
>> t = HashTable.new(100000)
>> words.each { |s| t.insert(s) }; puts "done"
>> t.print_stats
shortest bucket: 0 longest bucket: 15 empty buckets: 18298 mean bucket length: 2.58
=> nil
With ~200,000 words and 100,000 rows we would like to see 2 words/row
18% of the rows are empty
There is one fairly large bucket
✦ Repeat the experiment, but make a table with 500,000 rows
>> t = HashTable.new(500000)
>> words.each { |s| t.insert(s) }; puts "done"
>> t.print_stats
shortest bucket: 0 longest bucket: 7 empty buckets: 336216 mean bucket length: 1.29
=> nil
The longest bucket now has 7 words instead of 15
Mean bucket length is low -- most nonempty buckets probably have one word
But 2/3 of the rows are empty
✦ Advanced projects at the end of Section 6.5 (marked with ◆ ) explore the effect of table size
Table Size
500,000
Mean
Bucket
Length
1.29
Longest
Bucket
7
456,976 12.69
6,932
Challenge:
Explain why the table with 456,976 rows has such awful performance
See the chapter for more information
456,959 1.26
7
Note: 456,976 = 26 4 , and 456,959 is a prime number...
✦ A hash table is a technique for organizing information
❖ analogous to an index in a book
✦ Use a hash function to determine a row in the table
❖ insert a word s in row i = h ( s )
❖ to see if s is in the table compute i = h ( s ), look in row i
✦ Trivial hash function: use only the first letter of a word
✦ Better hash functions: use more letters
✦ Collisions occur when h ( s
1
) = h ( s
2
)
❖ far more frequent than we might guess (birthday paradox)
✦ A table can deal with collisions by using buckets
✦ As an example of how these ideas are used in practice let’s look at how
Google does its searches
✦ Google uses a “web crawler” to go out on the internet and fetch pages
❖ their goal is to get a copy of every page on the internet
✦ Issues:
❖
❖ don’t get dynamic pages (e.g. from on-line games) respect privacy: some sites don’t want to be searched
‣ example: course web site with solutions to problem sets
‣ book publishers ask faculty members to protect the pages containing solutions
✦ Once the pages are collected organize them so searches are fast
❖
❖ main index or “forward index” links a page ID to its location in the system a “reverse index” is an index for every word found on a page
❖ connects words to pages that contain them
✦ When processing a page to build the reverse index, consider the context
❖ words that occur in headlines (e.g. page titles) and section headers are given a higher weight than words appearing in the page body
❖ extract words appearing in links
✦ What made Google different: rank pages by “importance”
❖ an important page is one that others refer to
✦
❖
❖
Example -search for “ducks” (ca. 2008):
❖
❖ goducks.com (UO Athletic Department) wikipedia.org/wiki/Duck (Wikipedia page for “duck”) www.ducks.org (Ducks Unlimited, a conservation organization) ducks.nhl.com (Anaheim Mighty Ducks, a pro hockey team)
✦ From a 1998 paper:
“After each document is parsed... every word is converted into a wordID by using an inmemory hash table”
The Anatomy of a Large-Scale Hypertextual Web Search Engine,
Sergey Brin and Lawrence Page, Computer Networks 30(1-7): 107-117 (1998)
(search “google.pdf” to get a copy from a Stanford publication library)
✦
❖
❖
Some statistics from that era:
❖ the search engine used a “lexicon” of 14,000,000 words
❖ extracted from 24,000,000 pages
❖ pages took up 147GB word index required 37GB word list (lexicon) was 293MB
✦ Is a lexicon of 14,000,000 entries big enough?
❖ how many words does the average person use? know?
✦ Depending on how you define “word” the English language has between
100,000 and 250,000 words
❖
❖
❖ is “hot dog” a word?
how does punctuation work? is “pseudocode” the same as “pseudo-code”?
do you include place names? people? species names? ( “ Yersinia pestis ” ?)