tries

advertisement
Tries
 Trees of order >= 2
 Variable length keys
 The decision on what path to follow is
taken based on potion of the key
 Static environment, fast retrieval but
large space overhead
 Applications:
 dictionaries
 text searching (patricia tries)
 compression (Ziv-Lembel encoding)
E.G.M. Petrakis
Tries
1
Example
 MACCBOY
MACAW
MACARON
MACARONI
MACARONIC
MACAQUE
MACADAMIA
MACADAM
MACACACO
MACABRE
.
.
.
E.G.M. Petrakis
Tries
2
1. Words in Tree
M
A
C
A
C
A
B
C
D
Q
R
W
B
R
A
A
U
O
@
O
E
@
C
O @
M
E
N
O
Y
I
@
I
N
@
@
A
@
@
C @
@
@: word terminating symbol
as many words as terminating symbols
E.G.M. Petrakis
Tries
3
Word Searching
 At most m+1 comparisons to find the word x@=a1a2….am+1,
with root: i=1 and t(ai): child of node ai.
i=1; u = ai;
while (u != @ && i <= m) {
if (t(ai): undefined)
X is not in the trie;
else {
u = t(ai); i = i+1;
}
if (u == @ && word(t) == X) X is in the trie;
else X is not in the trie;
}
E.G.M. Petrakis
Tries
4
2. Words at the Leafs
M
A
C
A
B
.
C
C
D
.
Q
.
A
MACABRE MACACO
.
R
MACAQUE
M
N
I
. .
MACADAM
MACADAMI
I
.
W MACCBOY
.
O MACAW
O
MACAROON
MACARONI
C
.
MACARONIC
E.G.M. Petrakis
Tries
5
Example Implementation
26 symbols/node => large space overhead
E.G.M. Petrakis
Tries
6
Insertions
“bluejay” inserted
E.G.M. Petrakis
Tries
7
3. Compact Trie
compact
nodes
MAC
A
C
...

MACADAM
DAM
Q
I

MACAQUE

MACADAMI
RO
N

MACAW

MACABOY

MACAROO

MACARON

MACARONI
I
C
words
at the
leafs

MACARONIC
E.G.M. Petrakis
Tries
8
Patricia Trie
 Practical Algorithm To Retrieve
Information Coded In Alphanumeric
 especially good for text searching
 replace sub-words by a reference position in
text together with its length
 e.g., “jay” in “the blue jay” is (10,3)
 internal nodes keep only the length and the
first letter of the phrase
 external nodes keep the position of the phrase
in the text
E.G.M. Petrakis
Tries
9
Suffix Trie
 Construct a trie with




all suffixes
“the blue-jay@” => 12
suffixes
Merge branches with
common prefix
Leafs point to suffices
Search: “e-jay”
 go to position” 3 or 7
 increase by query
length: 5
 check suffices at
positions 3+5 and 7+5
 the second one matches
E.G.M. Petrakis
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Tries
theblue-jay@
heblue-jay@
eblue-jay@
blue-jay@
lue-jay@
ue-jay@
e-jay@
-jay@
jay@
ay@
y@
@
10
Suffix Trie
“the
1
t
2
4
b
h e
3
b
E.G.M. Petrakis
7
-
5
l
6
u
8 9
-
j
10
a
blue-jay”
11
y
12
@
Search: “e-jay@”
Tries
11
Search Patricia
 Scan query from left to right
 go to first matching symbol in trie
 increase by the increment
 if not in the trie => search failed
 if all symbols match the query => found !!!
E.G.M. Petrakis
Tries
12
Patricia Pros & Cons
 Good for secondary storage
applications
 the trie is not high
 The space overhead is still a problem
 Mainly useful for static environments
E.G.M. Petrakis
Tries
13
Ziv-Lempel Encoding
 Mainly compression method
 for any kind of data
 based on tries and symbol alphabets
 Basic idea: compute code for repeating
symbols or for sets of symbols
 Globally and locally optimal
 Compared with Huffman
 no a-priori knowledge of symbol probabilities
 Huffman is globally optimum but, not always
 Locally optimal (for parts of the input)
E.G.M. Petrakis
Tries
14
Encoding
 Construct trie:




nodes: code of a symbol or of sequences of symbols
arc: alphabet symbol
initially a trie with all symbols
each symbol is assigned a unique code
 Scan input from left to right:
 read next symbol
 follow appropriate branch in trie
 if symbol not in trie => insert symbol in trie, assign it
a new code and output its code
E.G.M. Petrakis
Tries
15
initial trie
root
a
b
c
z
........
1
2
3
26
codes
E.G.M. Petrakis
Tries
16
alphabet {a, b, c}
initial trie:
α
c
b
1
2
3
Input:
a
1
b
2
E.G.M. Petrakis
a
3
b
4
c
5
b
6
Tries
a
7
b
8
a
9
b
10
a
11
a
12
17
α
α
c

b
1
2
α
3
1
2
c
α
α
1
3
b
α
4
Tries

2
b
4
E.G.M. Petrakis
c
b

1
2
3
output 2
b
b

b
output 1
α
b
c
3
5
18
output 4
α
c
α
c
b
1
2
α
b
3

c
b
b
1

2
α
b
3
4
4
5
5
c
output 3
6
c
b
2

3
α
E.G.M. Petrakis
b
a
b

α
b
5
7
Tries
5
19
output 5
b
a
b
2
α

b

2
α
5
5
b
b
8
output 8
8
α
output 1
1
c
b
b
2
α
5
a

2
α
5
b
b

4
1
c
8
b
6
α
8
E.G.M. Petrakis
α
a
9
Tries
20
Decoding
 Follows the same sequence of steps in
reverse order
 start from the same initial trie
 scan the code from left to right
 read next code symbol
 go to the node having the same code
 if the code is not in the trie guess the
appropriate position in the trie
E.G.M. Petrakis
Tries
21
Alphabet: {a, b, c}
Code: 1 2 4 3 5 8 1
a
1
E.G.M. Petrakis
c
b
2
Tries
3
22
read(1): must be coming from “a”
output “α”
α
1
c

b
1
2
c
b
2
3
1
output “b”
4

b
4 Petrakis
E.G.M.
2
c

b
3
α
1
α
Tries
2
3
In order to go
from “1” to “2”,
we must have
visited “b”
23
In order to go from “2” to “4” we must go
through “ab” (passing from 1)
“a” was a child of “2”
output “αb”
α
c
3
b
1
2
E.G.M. Petrakis

α
b
4
3
5
Tries
24
In order to go from “4” to “3” we must
go through “c” and“c” was child of “4”
α
c
output “c”
b
1
2
5

α
b
4
3
5
c
6
E.G.M. Petrakis
Tries
25
In order to go from “3” to “5” we must go
through “ba” and “b” was child of “3”
The next symbol is “8”, but … there is no
“8” in the trie!!
α
c
b
1
2
4
3
α
b
5
8
b
7
c
6
E.G.M. Petrakis
Tries
26
This may happen only if after “5” we
passed through the same path since “8”
cannot be under “6” or “7” (there is no
“6” or “7” in the code)
α
c
output “bαb”
b
1
2
α
b
4
3
5
c
6
E.G.M. Petrakis
1
b

7
…
b
8
Tries
27
α
c
b
1
2
α
b
4
b
5
c
6
3
output “α”
7
b
8
α
9
E.G.M. Petrakis
Tries
28
Generalize
.....
d
.....
n
.....
d
.....

.....
.....
x
x
d
n
output “dxd”
Special case: “x” can be null
E.G.M. Petrakis
Tries
29
Download