Document Similarities Anand Bahety Cody Dunne Project Idea • Find similar segments of documents Project Idea (cont…) • Inexact matching – Local alignment (Smith-Waterman, BLAST) – Based on character • Meaningless to score character differences – Based on word • Need a good scoring function Project Idea (cont…) • Scoring function based on word relationships – Part of speech • Noun -> pronoun (ok) • Noun ->verb (worse) – Synonyms – positive score – Antonyms – negative score – Network of word relationships • WordNet – publicly available lexical English database – Gaps • Different numbers of adjectives/adverbs • Prepositions, pronouns Related Work • Document versioning (Versioning Machine, etc…) • Detecting plagiarism (Bagdis, etc…) Potential Pitfalls • False positives • The Great Wall of China is very famous. • The Fantastic Wall by XYZ is very famous. – Pick correct word meanings • False negatives – Database isn’t perfect/complete • Incomplete scoring function – Only examines particular types of words – Depends on order • Limited to English – EuroWordNet