KSL

advertisement
Query and Analysis on the
document and customer/item bag
card of the DataDex
Kellie Erickson
Outline
•
•
•
•
•
Proposed idea
Background information
General steps toward my idea
Objectives to achieve for next month
Q/A, comments, and suggestions
Proposed Idea:
Identifying the number of times an author cutand-pastes in a document
Item
∞
6
5
cust
itembag
card
4
3
itembag
itembag
card
2
People Author
 Customer
termdoc card
1
4
gene
gene
card
(ppi)
3
2
1
PI
1
2
term  G 1 2 3 4 5 6 7
2 3 4 5
expPI
card
1
Gene
1
1
docdoc
1
card
1
1
1
People 
1
3
3
Doc
1
1
expgene
card
3
4
5
gene
gene
card
(ppi)
6
t
Gene
1
2
termterm card
(share stem?)
3
4
5
6
7
1 2 3 4
1authordoc
1 1 1
1 card
1 1 1
5 6 7
ItemBag
1
1 2 3 4
ItemBag
5 6
∞
Possible applications:
•
•
•
•
Detect plagiarism
Quality of an author’s paper
Sentence query techniques
Automatic Key Sentence Generator
Steps towards proposed idea:
• information retrieval
• identify if authors cut-and-paste using
MBR
Background Information
•
•
•
•
keyword generator (KWL)
summarization generators (KSL)
anti-plagiarism tools (sentence similarity)
sentence query (sentence similarity)
Keyword/Key Phrase Generator
• What are keywords?
– Words that relate to a particular topic
• Identify keywords:
– position of the word in the document
– tf-idf score [4]:
• term frequency in a given document gives a measure of the importance
of a term within a particular document
• inverse document frequency is a measure of the general importance of
the term
• Automatic keyword extractor
– Kea (Baye’s Theorem) [8]
– GenEx (Genetic Algorithm) [7]
Key Sentence Generator
•
Identify Key Sentences:
–
–
–
–
–
–
•
Sentence position
Sentence length
tf-idf: sentences containing more keywords are more likely to be
relevant
Similarity to the title: Greater number of words in a sentence that
match the title, the more important the sentence [2]
Complete sentences [3]
Indicators (In conclusion…, We found…)
Key Sentence Extractor
–
Use scoring function [2]
Similarity Between Key Sentences
• Should identify semantically similar sentences
and sentences with equal or similar scores
• Methods:
1) Dice coefficient: based on the number of words
between two sentences
– 3 types of weights for each word:
• 1 if the word appears in a sentence, otherwise 0
• tf of a word
• tf-idf of the word [2]
Similarity Between Key Sentences
• Methods (cont):
2) Number of keywords between sentences [1]
3) if the intersection of their keyword sets are the
same size or slightly smaller [5]
Assumption:
• Focusing on the database aspect, not on the
linguistic point of view
Steps towards proposed idea:
• information retrieval
– Generate keyword/key phrase list (KWL)
– Generate key sentence list (KSL)
• identify if authors cut-and-paste
– Similarity between sentences
– MBR
Generate Keyword/Key Phrase List
• Use an approach already available (Kea)
• Need to consider:
– Stop words (ex. the, to, and, a, is, in, with, be)
– Stem words (ex. agree, agreed, agreeable) [9]
Generate Key Sentence List
• Use KWL and KPL to help identify key sentences
– Frequency of KWL and KPL found in a sentence
• Identify heuristics to determine key sentences
– Introduction
– First and last sentence in a paragraph
– Conclusion [6]
• Identify grammar rules to determine key sentences
Identify if authors cut and paste
• Implement MBR
– Paragraphs considered transactions
– Key sentences considered items in an item set
– Find frequent item sets to determine the amount
of cut-and-paste
Identify if authors cut and paste
• Implement MBR
– Key sentence list {ks1, ks2, ks3}
– Paragraphs in the document {p1, p2, p3 , p4}
p1
p2
p3
p4
ks1
1
0
1
0
ks2
1
0
0
0
ks3
1
0
1
1
Identify if authors cut and paste
• Identify sentence-to-sentence similarity
Objectives to achieve next month
• Find an adequate automatic KWL/KPL
extractor and run a training set.
• Identify heuristics and rules to create KSL
• Identify rules for similarity between
sentences
References
[1] E. Park, S. Moon, and D. Ra. Web Document Retrieval Using Sentence-query Similarity.
http://citeseer.ist.psu.edu/park02web.html.
[2] C. Nobato, S. Sekine, K. Uchimoto, and H. Isahara. A Summarization System with Categorization
of Document Sets. http://www.cs.nyu.edu/~sekine/papers/tsc2nova2.ps.
[3] E. Alfonseca, J. Guirao, and A. Moreno-Sandoval. Description of UAM System for Generating
Very Short Summaries at DUC-2004.
http://www.nlpir.nist.gov/projects/duc/pubs/2004papers/uautonoma2.alfonseca.pdf.
[4] Y. Uzun. Keyword Extraction Using Naïve Bayes.
http://www.cs.bilkent.edu.tr/~guvenir/courses/cs550/Workshop/Yasin_Uzun.pdf.
[5] C. Collberg, S. Kobourov, J. Louie, and T. Slattery. SPLAT: A System for Self-Plagiarism
Detection. http://splat.cs.arizona.edu/icwi_plag.pdf
[6] A Fox. Armando’s Paper Writing and Presentations Page.
http://swig.standford.edu/~fox/paper_writing.html.
[7] P.D. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 1999.
[8] E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. Nevill-Manning. Domain-specific
keyphrase extraction. In IJCAI, pages 668-673, 1999.
[9] J.Callan. Text Data Mining. http://hartford.lti.cs.cmu.edu/classes/95-779.
Download