Dr. David Dailey david.dailey@sru.edu Dr. Beverly Gocal beverly.gocal@sru.edu Dr. Deborah Whitfield deborah.whitfield@sru.edu Introduction Graph distance String Distance ◦ ◦ ◦ ◦ ◦ Definitions Examples Implementation Theoretical Results String Space Examples Distance ◦ may be defined for any structure Overlap of the substructures of two structures ◦ ◦ ◦ ◦ ◦ Strings Graphs Algebraic structures Semi-groups Trees Web site and web page similarity Past 15 years ◦ Over 20 papers on graph similarity ◦ Several more on string similarity Semi-Group Let T=(S, A) together with the concatenation operation, where A consists of the set of axioms ◦ x, y S, xy S ◦ x, y, z S, x(yz) = (xy)z Graph: Let T=(S, A) together with a relation ~ where A consists of the set of axioms ◦ x, y S, x ~ y y ~ x ◦ x , (x ~ x) String Let T=(S,A) together with an associative operation (expressed by concatenation). ◦ Then let Sn be defined recursively by S1 = S and Sn = S x Sn-1 and S* be defined as the infinite union of ordered tuples: S1 S2 …Sn Levenshtein distance calculates minimum number of transformations Largest shared substructure Smallest super structure All of these approaches are relative Enumerate all substructures within T and U Union those two sets (T* U*) =Z |Z|-dimensional vector space z(T) be the number of occurrences of structure z as a substructure of T Calculate Minkowski distance d(T,U) Alphabet S = {a,b,c}, a = abaac and b = cbaac a*= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac, abaac} b* = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa, baac,cbaac} Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac, cbaa, abaa, baac, cbaac, abaac } (underlined elements are unique to b and boldfaced are unique to a*) Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac} Different frequency: D={a}, Unique: O= {ab, cb, cba ,aba, cbaa, abaa, cbaac, abaac} |I| = 8 , |D| = 1, and |O| = 8 |I| = 8 , |D| = 1, and |O| = 8 |I| +|D| +|O| = |Z| = 18 . Contribution of O is |O| Contribution of I is 0 - substrings appear equally often Contribution of D, in this case will be 1. d(a,b) = contribution(I)+ contribution(D)+ contribution(O) = 9 A= aabc B= abcd S= {a, a, aa, aab, aabc, ab, abc, b, bc, c} T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d} Counts for S and T ◦ a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1 ◦ a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1 Differences: a:1 aad:1 aab:1 aabc:1 ab:0 abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1 Distance (aabc, abcd) = 8 Too tedious by hand http://srufaculty.sru.edu/david.dailey/javascr ipt/StringDistances.html Distance (aabc, abcd) = 8 Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I D|=0), then d(a,b) = n(n+1) Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I D|=0), then d(a,b) = n(n+1) Lemma: if a=an then d(a,aa)= n2 + n(n+1)/2 Conjecture: if |a|=|b|=n , then d(a,aa)=d(a,ab)=d(b,ab)=d(b,bb)= n2 + n(n+1)/2 Pretty pics Exhaustive substructure vector space Calculate distance Interesting observations used to study structure similarity based on size