String Matching A Lecture in CE Freshman Seminar Series: Ten Puzzling Problems in Computer Engineering May 2007 String Matching Slide 1 About This Presentation This presentation belongs to the lecture series entitled “Ten Puzzling Problems in Computer Engineering,” devised for a ten-week, one-unit, freshman seminar course by Behrooz Parhami, Professor of Computer Engineering at University of California, Santa Barbara. The material can be used freely in teaching and other educational settings. Unauthorized uses, including any use for financial gain, are prohibited. © Behrooz Parhami May 2007 Edition Released First May 2007 Revised String Matching Revised Slide 2 Word Search Puzzles The puzzle below is a little harder than the normal word search: one of the 36 first/last names has been left out (which one?) Type 1, With Word List Supplied AGITATOR ASSEMBLY CLUTCH CONNECTORS CONTROL COUPLING May 2007 GLIDE LINT SCREEN PULLEY SEAL SWITCH VALVE AMY STEEL KEVIN BLAIR RON PALILLO BARBARA BINGHAM KIRSTEN BAKER SHAVAR ROSS BRUCE MAHLER LARRY ZERNER STU CHARNO String Matching CAROL LACATELL MARK NELSON SUSAN BLU DANA KIMMELL PAUL KRATKA TONY GOLDWYN JOHN FUREY RICHARD YOUNG TRACIE SAVAGE Slide 3 “Ten Puzzling Problems in Computer Engineering” Word Search WORD LIST: BINARY SEARCH BYZANTINE GENERALS CRYPTOGRAPHY EASY HARD IMPOSSIBLE MALFUNCTION DIAGNOSIS PLACEMENT AND ROUTING SATISFIABILITY SORTING NETWORKS STRING MATCHING TASK SCHEDULING V L L O X E C Q B Y X W B Z O R D Q C S K Y E O M O O R S I P W O A Y W K Q I O T O A F G N I H C T A M G N I R T S N R A W S B Y N A Z F H L S W C N Y O O Z T S F Y H L P A K K Z S L F S X N H U P I K H H Q D G B A U S M K A J G W P C J N S L A R E N E G E N I T N A Z Y B L V G C X R J R P H T I Z I Q I J Y A B B E N H E D W S S E D G S S D F C F N I C A E E Z I H K J U A F T N J R D L N B V E T D P M S R F K I Q O T Y J J A C M L L W U X P B A J A H I A P U W R E E V L H O Puzzle generated at: http://puzzlemaker.school.discovery.com/WordSearchSetupForm.html May 2007 String Matching L F O F C B Q T M T Z W Y B M U F J X R I L S N I Z C Y O X J S F Z B D B L Z K N O S L N N B G K V E G Q U X N C R P S G N I T U O R D N A T N E M E C A L P X J T B F D A T Y R C E G C I G D D N B J Y D L T P E M C U J E K X T C K W Q D I Y A E H T B H R Y S Q W S N Y C P F E Y Slide 4 M G Y B B Y H D T W Z J Q Q J L G W Y Q Word Search Puzzle Type 2, With Clues Supplied for the Words to be Found Seven birds Five units of length Four currencies Two things football players wear Large gland in the neck L P I H A W K V E O E N J Z F E A X O S C O J G G W T N O H E R L Y H T Z C R E E A Y E S J S T U R R T L N E X R D O P M M Y Z O R I B O I E J K X D N I U L T R E T E M N N E D O L L A R B D USA Today’s “Word Roundup” for May 16, 2007: http://puzzles.usatoday.com/ 00 12 24 36 LEAGLEUROKRDPOXWYARDRXEOIEOTHYROIDTLHNSNTETPBNEL AJCOZSLMOIMAWZOHCJNMIUNRKFJERSEYELNBVEGRETXZJTED 48 60 May 2007 72 String Matching 84 Slide 5 String Matching: Problem Definition Given a data string with n symbols and a pattern string with m symbols: 1. Does the pattern string appear in the data string? 2. What are the locations of all occurrences of the pattern in the data? The brute-force, or sliding window, algorithm Consider all possible positions where the pattern might begin (n – m + 1) For each start position, do up to m comparison to see if there is a match Worst-case complexity = O(mn); e.g., pattern “aaaaa”, data “aaaaaaaaaa” Pattern string of length m = 5 symbols: EAGLE Data string of length n = 96 symbols 12 24 36 EAGLE EAGLE EAGLE LEAGLEUROKRDPOXWYARDRXEOIEOTHYROIDTLHNSNTETPBNEL AJCOZSLMOIMAWZOHCJNMIUNRKFJERSEYELNBVEGRETXZJTED 00 48 60 May 2007 72 String Matching 84 Slide 6 Converting 2D Search Puzzles to 1D Searches L R I H A W K V A 2D word search puzzle looks more exotic but it can be readily converted to a 1D string search puzzle Row-major order E A E N J Z F E A X O S C O J G G W T N O H E R L Y H T Z C R E E A Y E S J S T U R R T L N E X R D O P M M Y Z O R I B O I E J E X D N I U L T X E T E M N N E LEAGLEUROEXTRAXWYARDRXEOIEOTHYROIDTLHNSNTETPBNEL AJCOZSLMOIMAWZOHCJNMIUNRKFJERSEYELNBVEGRETXZJTED Insert a special symbol (#) between rows to ensure that new words or patterns are not created by the expansion LEAGLEUROEXT#RAXWYARDRXEO#IEOTHYROIDTL#HNSNTETPBNEL# AJCOZSLMOIMA#WZOHCJNMIUNR#KFJERSEYELNB#VEGRETXZJTED# Column-major order May 2007 Similarly for (anti)diagonal String Matching Slide 7 T O L L A R B D Finding a Needle in a Haystack Search for the 10-symbol “needle” h-e-l-e-n- -h-u-n-t in the Internet “haystack” with many TBs of data The brute force algorithm amounts to the following: “Look in this corner, now in this other corner, then over there, and so on.” The Internet holds some 20B pages, each page on average containing in excess of 100 KB m 10 n 20109105 = 21012 B O(mn) 21013 comparisons 20,000 s (> 5 hr), with 109 comparisons/ s May 2007 String Matching Slide 8 Needle in a Haystack: Internet Search Search for the 10-symbol string h-e-l-e-n- -h-u-n-t May 2007 String Matching Slide 9 Needle in a Haystack: Doing Less Work For a particular pattern and unpredictable data strings, preprocess the pattern so that searching for it in various data strings becomes faster Analogy: Magnetize the needle For a particular data string and unpredictable patterns, preprocess the data string so that when a pattern is supplied, we can readily find it with much less work Analogy: Do a thorough search of the haystack for different types of needles and place markers to guide future searches May 2007 String Matching Slide 10 Example of Preprocessing the Pattern String Devise an efficient method for finding the pattern “abcbab” in various data strings formed from the symbols a, b and c b,c 0 a a c 1 b a b 2 a a c 3 a b 4 a b 5 6 c c b,c c b O(n) instead of O(mn) Data string: a b c b b b a b c b a b b c a a b c b a b c b a b c b b 0 1 2 3 4 0 0 1 2 3 4 5 6 0 0 1 1 2 3 4 5 6 3 4 5 6 3 4 0 May 2007 String Matching Slide 11 Example of Preprocessing the Data String Devise an efficient method for finding various patterns in the data string: a b c b b b a b c b a b b c a a b c b a b c b a b c b b 0 1 2 3 a a a b b b b b b c c c a b b a b b b c c a b b b b c b a b c a b a a b 4 5 6 7 8 9 10 11 12 14 10 0, 6, 15, 19, 23 5, 9, 18, 22 4 3 11 12 1, 7, 16, 20, 24 13 8, 17, 21 2, 25 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Find all occurrences of the pattern “abcbab” a b c b b c b a c b a b 0, 6, 15, 19, 23 1, 7, 16, 20, 24 8, 17, 21 5, 9, 18, 22 a b c b b c b a c b a b 0, 6, 15, 19, 23 1, 7, 16, 20, 24 8, 17, 21 5, 9, 18, 22 Alternate strategy: Focus on the locations of a b c and b a b May 2007 String Matching Slide 12 27 Search Engine Indexes May 2007 String Matching Slide 13 Approximate String Matching Notion of string distance Each of the following transformations in a string creates a distance of 1 1. Deletion of a symbol 2. Insertion of an extra symbol 3. Transposition of two adjacent symbols Example distance-1 strings for helen hunt: Example distance-2 strings for helen hunt: hellen hunt Insertion hellen hnut Deletion elen hunt elen huntt helen hnut Transposition lheen hunt Insertion + Transposition Deletion + Insertion 2 Transpositions Wildcard symbols can help in formulating approximate string searches h* hunt means any string that begins with an “h”, ends with “hunt”, and has an arbitrary set of symbols between the two Melvyl (UC library catalog) allows such searches, e.g., author: hunt h* May 2007 String Matching Slide 14