Query and Analysis on the document and customer/item bag card of the DataDex Kellie Erickson Outline • • • • • Proposed idea Background information General steps toward my idea Objectives to achieve for next month Q/A, comments, and suggestions Proposed Idea: Identifying the number of times an author cutand-pastes in a document Item ∞ 6 5 cust itembag card 4 3 itembag itembag card 2 People Author Customer termdoc card 1 4 gene gene card (ppi) 3 2 1 PI 1 2 term G 1 2 3 4 5 6 7 2 3 4 5 expPI card 1 Gene 1 1 docdoc 1 card 1 1 1 People 1 3 3 Doc 1 1 expgene card 3 4 5 gene gene card (ppi) 6 t Gene 1 2 termterm card (share stem?) 3 4 5 6 7 1 2 3 4 1authordoc 1 1 1 1 card 1 1 1 5 6 7 ItemBag 1 1 2 3 4 ItemBag 5 6 ∞ Possible applications: • • • • Detect plagiarism Quality of an author’s paper Sentence query techniques Automatic Key Sentence Generator Steps towards proposed idea: • information retrieval • identify if authors cut-and-paste using MBR Background Information • • • • keyword generator (KWL) summarization generators (KSL) anti-plagiarism tools (sentence similarity) sentence query (sentence similarity) Keyword/Key Phrase Generator • What are keywords? – Words that relate to a particular topic • Identify keywords: – position of the word in the document – tf-idf score [4]: • term frequency in a given document gives a measure of the importance of a term within a particular document • inverse document frequency is a measure of the general importance of the term • Automatic keyword extractor – Kea (Baye’s Theorem) [8] – GenEx (Genetic Algorithm) [7] Key Sentence Generator • Identify Key Sentences: – – – – – – • Sentence position Sentence length tf-idf: sentences containing more keywords are more likely to be relevant Similarity to the title: Greater number of words in a sentence that match the title, the more important the sentence [2] Complete sentences [3] Indicators (In conclusion…, We found…) Key Sentence Extractor – Use scoring function [2] Similarity Between Key Sentences • Should identify semantically similar sentences and sentences with equal or similar scores • Methods: 1) Dice coefficient: based on the number of words between two sentences – 3 types of weights for each word: • 1 if the word appears in a sentence, otherwise 0 • tf of a word • tf-idf of the word [2] Similarity Between Key Sentences • Methods (cont): 2) Number of keywords between sentences [1] 3) if the intersection of their keyword sets are the same size or slightly smaller [5] Assumption: • Focusing on the database aspect, not on the linguistic point of view Steps towards proposed idea: • information retrieval – Generate keyword/key phrase list (KWL) – Generate key sentence list (KSL) • identify if authors cut-and-paste – Similarity between sentences – MBR Generate Keyword/Key Phrase List • Use an approach already available (Kea) • Need to consider: – Stop words (ex. the, to, and, a, is, in, with, be) – Stem words (ex. agree, agreed, agreeable) [9] Generate Key Sentence List • Use KWL and KPL to help identify key sentences – Frequency of KWL and KPL found in a sentence • Identify heuristics to determine key sentences – Introduction – First and last sentence in a paragraph – Conclusion [6] • Identify grammar rules to determine key sentences Identify if authors cut and paste • Implement MBR – Paragraphs considered transactions – Key sentences considered items in an item set – Find frequent item sets to determine the amount of cut-and-paste Identify if authors cut and paste • Implement MBR – Key sentence list {ks1, ks2, ks3} – Paragraphs in the document {p1, p2, p3 , p4} p1 p2 p3 p4 ks1 1 0 1 0 ks2 1 0 0 0 ks3 1 0 1 1 Identify if authors cut and paste • Identify sentence-to-sentence similarity Objectives to achieve next month • Find an adequate automatic KWL/KPL extractor and run a training set. • Identify heuristics and rules to create KSL • Identify rules for similarity between sentences References [1] E. Park, S. Moon, and D. Ra. Web Document Retrieval Using Sentence-query Similarity. http://citeseer.ist.psu.edu/park02web.html. [2] C. Nobato, S. Sekine, K. Uchimoto, and H. Isahara. A Summarization System with Categorization of Document Sets. http://www.cs.nyu.edu/~sekine/papers/tsc2nova2.ps. [3] E. Alfonseca, J. Guirao, and A. Moreno-Sandoval. Description of UAM System for Generating Very Short Summaries at DUC-2004. http://www.nlpir.nist.gov/projects/duc/pubs/2004papers/uautonoma2.alfonseca.pdf. [4] Y. Uzun. Keyword Extraction Using Naïve Bayes. http://www.cs.bilkent.edu.tr/~guvenir/courses/cs550/Workshop/Yasin_Uzun.pdf. [5] C. Collberg, S. Kobourov, J. Louie, and T. Slattery. SPLAT: A System for Self-Plagiarism Detection. http://splat.cs.arizona.edu/icwi_plag.pdf [6] A Fox. Armando’s Paper Writing and Presentations Page. http://swig.standford.edu/~fox/paper_writing.html. [7] P.D. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 1999. [8] E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. Nevill-Manning. Domain-specific keyphrase extraction. In IJCAI, pages 668-673, 1999. [9] J.Callan. Text Data Mining. http://hartford.lti.cs.cmu.edu/classes/95-779.