Slides

advertisement
Dr. David Dailey
david.dailey@sru.edu
Dr. Beverly Gocal
beverly.gocal@sru.edu
Dr. Deborah Whitfield
deborah.whitfield@sru.edu



Introduction
Graph distance
String Distance
◦
◦
◦
◦
◦
Definitions
Examples
Implementation
Theoretical Results
String Space Examples

Distance
◦ may be defined for any structure

Overlap of the substructures of two
structures
◦
◦
◦
◦
◦

Strings
Graphs
Algebraic structures
Semi-groups
Trees
Web site and web page similarity

Past 15 years
◦ Over 20 papers on graph similarity
◦ Several more on string similarity


Semi-Group
Let T=(S, A) together with the concatenation
operation, where A consists of the set of
axioms
◦ x, y  S, xy  S
◦ x, y, z S, x(yz) = (xy)z

Graph: Let T=(S, A) together with a relation ~
where A consists of the set of axioms
◦ x, y S, x ~ y  y ~ x
◦ x , (x ~ x)

String Let T=(S,A) together with an
associative operation (expressed by
concatenation).
◦ Then let Sn be defined recursively by
 S1 = S and
 Sn = S x Sn-1 and
 S* be defined as the infinite union of ordered tuples:
S1 S2 …Sn




Levenshtein distance calculates minimum
number of transformations
Largest shared substructure
Smallest super structure
All of these approaches are relative





Enumerate all substructures within T and U
Union those two sets (T*  U*) =Z
|Z|-dimensional vector space
z(T) be the number of occurrences of
structure z as a substructure of T
Calculate Minkowski distance d(T,U)








Alphabet S = {a,b,c}, a = abaac and b = cbaac
a*= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac,
abaac}
b* = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa,
baac,cbaac}
Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac,
cbaa, abaa, baac, cbaac, abaac } (underlined
elements are unique to b and boldfaced are unique
to a*)
Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac}
Different frequency: D={a},
Unique: O= {ab, cb, cba ,aba, cbaa, abaa, cbaac,
abaac}
|I| = 8 , |D| = 1, and |O| = 8






|I| = 8 , |D| = 1, and |O| = 8
|I| +|D| +|O| = |Z| = 18 .
Contribution of O is |O|
Contribution of I is 0 - substrings appear
equally often
Contribution of D, in this case will be 1.
d(a,b) = contribution(I)+ contribution(D)+
contribution(O) = 9




A= aabc B= abcd
S= {a, a, aa, aab, aabc, ab, abc, b, bc, c}
T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d}
Counts for S and T
◦ a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1
◦ a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1


Differences: a:1 aad:1 aab:1 aabc:1 ab:0
abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1
Distance (aabc, abcd) = 8



Too tedious by hand
http://srufaculty.sru.edu/david.dailey/javascr
ipt/StringDistances.html
Distance (aabc, abcd) = 8




Conjecture: if |a|=|b|=n and a and b share no
substrings in common (i.e., |I  D|=0), then
d(a,b) = n(n+1)
Conjecture: if |a|=|b|=n and a and b share no
substrings in common (i.e., |I  D|=0), then
d(a,b) = n(n+1)
Lemma: if a=an then d(a,aa)= n2 + n(n+1)/2
Conjecture: if |a|=|b|=n , then
d(a,aa)=d(a,ab)=d(b,ab)=d(b,bb)=
n2 + n(n+1)/2

Pretty pics



Exhaustive substructure vector space
Calculate distance
Interesting observations used to study
structure similarity based on size
Download