String Matching A Lecture in CE Freshman Seminar Series: May 2007

advertisement
String Matching
A Lecture in CE Freshman Seminar Series:
Ten Puzzling Problems in Computer Engineering
May 2007
String Matching
Slide 1
About This Presentation
This presentation belongs to the lecture series entitled
“Ten Puzzling Problems in Computer Engineering,”
devised for a ten-week, one-unit, freshman seminar
course by Behrooz Parhami, Professor of Computer
Engineering at University of California, Santa Barbara.
The material can be used freely in teaching and other
educational settings. Unauthorized uses, including any
use for financial gain, are prohibited. © Behrooz Parhami
May 2007
Edition
Released
First
May 2007
Revised
String Matching
Revised
Slide 2
Word Search Puzzles
The puzzle below is a little harder than the normal word search:
one of the 36 first/last names has been left out (which one?)
Type 1, With Word List Supplied
AGITATOR
ASSEMBLY
CLUTCH
CONNECTORS
CONTROL
COUPLING
May 2007
GLIDE
LINT SCREEN
PULLEY
SEAL
SWITCH
VALVE
AMY STEEL
KEVIN BLAIR
RON PALILLO
BARBARA BINGHAM
KIRSTEN BAKER
SHAVAR ROSS
BRUCE MAHLER
LARRY ZERNER
STU CHARNO
String Matching
CAROL LACATELL
MARK NELSON
SUSAN BLU
DANA KIMMELL
PAUL KRATKA
TONY GOLDWYN
JOHN FUREY
RICHARD YOUNG
TRACIE SAVAGE
Slide 3
“Ten Puzzling
Problems in
Computer
Engineering”
Word Search
WORD LIST:
BINARY SEARCH
BYZANTINE GENERALS
CRYPTOGRAPHY
EASY HARD IMPOSSIBLE
MALFUNCTION DIAGNOSIS
PLACEMENT AND ROUTING
SATISFIABILITY
SORTING NETWORKS
STRING MATCHING
TASK SCHEDULING
V
L
L
O
X
E
C
Q
B
Y
X
W
B
Z
O
R
D
Q
C
S
K
Y
E
O
M
O
O
R
S
I
P
W
O
A
Y
W
K
Q
I
O
T
O
A
F
G
N
I
H
C
T
A
M
G
N
I
R
T
S
N
R
A
W
S
B
Y
N
A
Z
F
H
L
S
W
C
N
Y
O
O
Z
T
S
F
Y
H
L
P
A
K
K
Z
S
L
F
S
X
N
H
U
P
I
K
H
H
Q
D
G
B
A
U
S
M
K
A
J
G
W
P
C
J
N
S
L
A
R
E
N
E
G
E
N
I
T
N
A
Z
Y
B
L
V
G
C
X
R
J
R
P
H
T
I
Z
I
Q
I
J
Y
A
B
B
E
N
H
E
D
W
S
S
E
D
G
S
S
D
F
C
F
N
I
C
A
E
E
Z
I
H
K
J
U
A
F
T
N
J
R
D
L
N
B
V
E
T
D
P
M
S
R
F
K
I
Q
O
T
Y
J
J
A
C
M
L
L
W
U
X
P
B
A
J
A
H
I
A
P
U
W
R
E
E
V
L
H
O
Puzzle generated at:
http://puzzlemaker.school.discovery.com/WordSearchSetupForm.html
May 2007
String Matching
L
F
O
F
C
B
Q
T
M
T
Z
W
Y
B
M
U
F
J
X
R
I
L
S
N
I
Z
C
Y
O
X
J
S
F
Z
B
D
B
L
Z
K
N
O
S
L
N
N
B
G
K
V
E
G
Q
U
X
N
C
R
P
S
G
N
I
T
U
O
R
D
N
A
T
N
E
M
E
C
A
L
P
X
J
T
B
F
D
A
T
Y
R
C
E
G
C
I
G
D
D
N
B
J
Y
D
L
T
P
E
M
C
U
J
E
K
X
T
C
K
W
Q
D
I
Y
A
E
H
T
B
H
R
Y
S
Q
W
S
N
Y
C
P
F
E
Y
Slide 4
M
G
Y
B
B
Y
H
D
T
W
Z
J
Q
Q
J
L
G
W
Y
Q
Word Search Puzzle
Type 2, With Clues Supplied for the Words to be Found
Seven birds

Five units of length

Four currencies

Two things football players wear

Large gland in the neck

L
P
I
H
A
W
K
V
E
O
E
N
J
Z
F
E
A
X
O
S
C
O
J
G
G
W
T
N
O
H
E
R
L
Y
H
T
Z
C
R
E
E
A
Y
E
S
J
S
T
U
R
R
T
L
N
E
X
R
D
O
P
M
M
Y
Z
O
R
I
B
O
I
E
J
K
X
D
N
I
U
L
T
R
E
T
E
M
N
N
E
D
O
L
L
A
R
B
D
USA Today’s “Word Roundup” for May 16, 2007: http://puzzles.usatoday.com/
00
12
24
36
LEAGLEUROKRDPOXWYARDRXEOIEOTHYROIDTLHNSNTETPBNEL
AJCOZSLMOIMAWZOHCJNMIUNRKFJERSEYELNBVEGRETXZJTED
48
60
May 2007
72
String Matching
84
Slide 5
String Matching: Problem Definition
Given a data string with n symbols and a pattern string with m symbols:
1. Does the pattern string appear in the data string?
2. What are the locations of all occurrences of the pattern in the data?
The brute-force, or sliding window, algorithm
Consider all possible positions where the pattern might begin (n – m + 1)
For each start position, do up to m comparison to see if there is a match
Worst-case complexity = O(mn); e.g., pattern “aaaaa”, data “aaaaaaaaaa”
Pattern string of length m = 5 symbols: EAGLE
Data string of length n = 96 symbols
12
24
36
EAGLE
EAGLE
EAGLE
LEAGLEUROKRDPOXWYARDRXEOIEOTHYROIDTLHNSNTETPBNEL
AJCOZSLMOIMAWZOHCJNMIUNRKFJERSEYELNBVEGRETXZJTED
00
48
60
May 2007
72
String Matching
84
Slide 6
Converting 2D Search Puzzles to 1D Searches
L
R
I
H
A
W
K
V
A 2D word search puzzle looks more
exotic but it can be readily converted
to a 1D string search puzzle
Row-major order
E
A
E
N
J
Z
F
E
A
X
O
S
C
O
J
G
G
W
T
N
O
H
E
R
L
Y
H
T
Z
C
R
E
E
A
Y
E
S
J
S
T
U
R
R
T
L
N
E
X
R
D
O
P
M
M
Y
Z
O
R
I
B
O
I
E
J
E
X
D
N
I
U
L
T
X
E
T
E
M
N
N
E
LEAGLEUROEXTRAXWYARDRXEOIEOTHYROIDTLHNSNTETPBNEL
AJCOZSLMOIMAWZOHCJNMIUNRKFJERSEYELNBVEGRETXZJTED
Insert a special symbol (#) between rows to ensure that new words or
patterns are not created by the expansion
LEAGLEUROEXT#RAXWYARDRXEO#IEOTHYROIDTL#HNSNTETPBNEL#
AJCOZSLMOIMA#WZOHCJNMIUNR#KFJERSEYELNB#VEGRETXZJTED#
Column-major order
May 2007
Similarly for (anti)diagonal
String Matching
Slide 7
T
O
L
L
A
R
B
D
Finding a Needle in a Haystack
Search for the 10-symbol “needle”
h-e-l-e-n- -h-u-n-t in the Internet
“haystack” with many TBs of data
The brute force algorithm
amounts to the following:
“Look in this corner, now in
this other corner, then over
there, and so on.”
The Internet holds some
20B pages, each page
on average containing
in excess of 100 KB
m  10
n  20109105 = 21012 B
O(mn)  21013 comparisons  20,000 s (> 5 hr), with 109 comparisons/ s
May 2007
String Matching
Slide 8
Needle in a Haystack: Internet Search
Search for the 10-symbol
string h-e-l-e-n- -h-u-n-t
May 2007
String Matching
Slide 9
Needle in a Haystack: Doing Less Work
For a particular pattern and
unpredictable data strings,
preprocess the pattern so that
searching for it in various data
strings becomes faster
Analogy: Magnetize the needle
For a particular data string
and unpredictable patterns,
preprocess the data string
so that when a pattern is
supplied, we can readily
find it with much less work
Analogy: Do a thorough search
of the haystack for different types of needles
and place markers to guide future searches
May 2007
String Matching
Slide 10
Example of Preprocessing the Pattern String
Devise an efficient method for finding the pattern “abcbab” in various
data strings formed from the symbols a, b and c
b,c
0
a
a
c
1
b
a
b
2
a
a
c
3
a
b
4
a
b
5
6
c
c
b,c
c
b
O(n) instead of O(mn)
Data string:
a b c b b b a b c b a b b c a a b c b a b c b a b c b b
0 1 2 3 4 0 0 1 2 3 4 5 6 0 0 1 1 2 3 4 5 6 3 4 5 6 3 4 0
May 2007
String Matching
Slide 11
Example of Preprocessing the Data String
Devise an efficient method for finding various patterns in the data string:
a b c b b b a b c b a b b c a a b c b a b c b a b c b b
0
1
2
3
a
a
a
b
b
b
b
b
b
c
c
c
a
b
b
a
b
b
b
c
c
a
b
b
b
b
c
b
a
b
c
a
b
a
a
b
4
5
6
7
8
9
10
11
12
14
10
0, 6, 15, 19, 23
5, 9, 18, 22
4
3
11
12
1, 7, 16, 20, 24
13
8, 17, 21
2, 25
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Find all occurrences of the
pattern “abcbab”
a
b
c
b
b
c
b
a
c
b
a
b
0, 6, 15, 19, 23
1, 7, 16, 20, 24
8, 17, 21
5, 9, 18, 22
a
b
c
b
b
c
b
a
c
b
a
b
0, 6, 15, 19, 23
1, 7, 16, 20, 24
8, 17, 21
5, 9, 18, 22
Alternate strategy: Focus on the locations of a b c and b a b
May 2007
String Matching
Slide 12
27
Search Engine Indexes
May 2007
String Matching
Slide 13
Approximate String Matching
Notion of string distance
Each of the following transformations in a string creates a distance of 1
1. Deletion of a symbol
2. Insertion of an extra symbol
3. Transposition of two adjacent symbols
Example distance-1 strings
for helen hunt:
Example distance-2 strings
for helen hunt:
hellen hunt Insertion
hellen hnut
Deletion
elen hunt
elen huntt
helen hnut Transposition lheen hunt
Insertion + Transposition
Deletion + Insertion
2 Transpositions
Wildcard symbols can help in formulating approximate string searches
h* hunt means any string that begins with an “h”, ends with “hunt”, and
has an arbitrary set of symbols between the two
Melvyl (UC library catalog) allows such searches, e.g., author: hunt h*
May 2007
String Matching
Slide 14
Download