Document Similarities Anand Bahety Cody Dunne

advertisement
Document Similarities
Anand Bahety
Cody Dunne
Project Idea
• Find similar segments of documents
Project Idea (cont…)
• Inexact matching
– Local alignment (Smith-Waterman, BLAST)
– Based on character
• Meaningless to score character differences
– Based on word
• Need a good scoring function
Project Idea (cont…)
• Scoring function based on word relationships
– Part of speech
• Noun -> pronoun (ok)
• Noun ->verb (worse)
– Synonyms – positive score
– Antonyms – negative score
– Network of word relationships
• WordNet – publicly available lexical English database
– Gaps
• Different numbers of adjectives/adverbs
• Prepositions, pronouns
Related Work
• Document versioning (Versioning Machine, etc…)
• Detecting plagiarism (Bagdis, etc…)
Potential Pitfalls
• False positives
• The Great Wall of China is very famous.
• The Fantastic Wall by XYZ is very famous.
– Pick correct word meanings
• False negatives
– Database isn’t perfect/complete
• Incomplete scoring function
– Only examines particular types of words
– Depends on order
• Limited to English
– EuroWordNet
Download