concept-based mining model

advertisement
國立雲林科技大學
National Yunlin University of Science and Technology
An Efficient Concept-Based
Mining Model for Enhancing Text
Clustering
Shady Shehata, Fakhri Karray, and Mohamed S. Kamel
TKDE, 2010
Presented by Wen-Chung Liao
2010/11/03
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outlines







2
Motivation
Objectives
THEMATIC ROLES BACKGROUND
CONCEPT-BASED MINING MODEL
Experiments
Conclusions
Comments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation

Vector Space Model (VSM)
─
─
─
represents each document as a feature vector of the terms
(words or phrases) in the document.
Each feature vector contains term weights (usually term
frequencies) of the terms in the document.
term frequency captures the importance of the term
within a document only.


3
However, two terms can have the same frequency in
their documents, but one term contributes more to
the meaning of its sentences than the other term.
Thus, the underlying text mining model should
indicate terms that capture the semantics of text.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objectives

A new concept-based mining model is introduced.
─
─
─
─
captures the semantic structure of each term within a sentence
and document rather than the frequency of the term within a
document only
effectively discriminate between nonimportant terms and
terms which hold the concepts that represent the sentence
meaning.
three measures for analyzing concepts on the sentence,
document, and corpus levels are computed
a new concept-based similarity measure is proposed.

─
4
based on a combination of sentence-based, document-based, and
corpus-based concept analysis.
more significant effect on the clustering quality due to the
similarity’s insensitivity to noisy terms.
Intelligent Database Systems Lab
THEMATIC ROLES
BACKGROUND

Verb argument structure: (e.g., John hits the ball).
─
─


5
e.g.: “John” has subject (or Agent) label. “the ball” has object (or
theme) label,
Term: is either an argument or a verb.
─

“hits” is the verb.
“John” and “the ball” are the arguments of the verb “hits,”
Label: A label is assigned to an argument,
─

N.Y.U.S.T.
I. M.
either a word or a phrase
Concept: a labeled term.
Generally, the semantic structure of a sentence can
be characterized by a form of verb argument
structure
Intelligent Database Systems Lab
CONCEPT-BASED MINING
MODEL
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
CONCEPT-BASED MINING
MODEL

Sentence-Based Concept Analysis
─
Calculating ctf of Concept c in Sentence s

the conceptual term frequency, ctf



─
Calculating ctf of Concept c in Document d

7
the number of occurrences of concept c in verb argument
structures of sentence s.
has the principal role of contributing to the meaning of s
a local measure on the sentence level
the overall importance of concept c to the meaning of its
sentences in document d.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
CONCEPT-BASED MINING
MODEL

Document-Based Concept Analysis
─
the concept-based term frequency tf



Corpus-Based Concept Analysis
─
the concept-based document frequency df


8
the number of occurrences of a concept (word or phrase) c in
the original document.
a local measure on the document level
the number of documents containing concept c
used to reward the concepts that only appear in a small
number of documents
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Example of Calculating ctf Measure
N.Y.U.S.T.
I. M.
Texas and Australia researchers have created
industry-ready sheets of materials made from
nanotubes that could lead to the development of
artificial muscles.


Three verbs, colored by red, that represent the semantic
structure of the meaning of the sentence.
Each has its own arguments:
─
─
─
9
[ARG0 Texas and Australia researchers] have [TARGET created]
[ARG1 industry-ready sheets of materials made from nanotubes that
could lead to the development of artificial muscles].
Texas and Australia researchers have created industry-ready sheets of
[ARG1 materials] [TARGET made] [ARG2 from nanotubes that
could lead to the development of artificial muscles].
Texas and Australia researchers have created industry-ready sheets of
materials made from [ARG1 nanotubes] [R-ARG1 that] [ARGMMOD could] [TARGET lead] [ARG2 to the development of
artificial muscles].
Intelligent Database Systems Lab
A clean step
 To remove stop words
 To stem the words
10
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
A Concept-Based Similarity Measure

The concept-based similarity between two documents,
d1 and d2 is calculated by:
N.Y.U.S.T.
I. M.
m matching concepts
d1
d2
• The single-term similarity measure is:
(using the TF-IDF weighting scheme)
11
Intelligent Database Systems Lab
Mathematical Framework


Assume that the content of document d2 is changed by △
Sensitivity analysis:
• Assume that each concept consists of one word.
• In this case, each concept is a word and A =1. (?)
• By approximation, the d1c value is bigger than d1w and the
△d2c value is bigger than the △ d2w value.
• Hence, the sensitivity of the concept-based similarity is
higher than the cosine similarity.
• This means that the concept-based model is deeper in
analyzing the similarity between two documents than the
traditional approaches.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Concept-Based Analysis Algorithm
N.Y.U.S.T.
I. M.
d1 d2 d3 d4
d1
d2 L
d3 L L
d4 L L L
13
Intelligent Database Systems Lab
EXPERIMENTAL RESULTS

Four data sets
─
─
─
─

─
─
14
Evaluation methods
23,115 ACM abstract articles collected from the
ACM digital library

five main categories
12,902 documents from the Reuters 21,578 data
set

five category sets
361 samples from the Brown corpus

main categories were press: reportage; press:
reviews, religion, skills and hobbies, popular
lore, belles-letters, and learned; fiction:
science; fiction: romance and humor.
20,000 messages collected from 20 Usenet
newsgroups
Three standard document clustering
techniques:
─
N.Y.U.S.T.
I. M.
Hierarchical Agglomerative Clustering (HAC),
Single-Pass Clustering
k-Nearest Neighbor (k-NN)
Intelligent Database Systems Lab
Four different concept-based weighting schemes:
N.Y.U.S.T.
I. M.
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions



Bridges the gap between natural language
processing and text mining disciplines. (?)
By exploiting the semantic structure of the
sentences in documents, a better text
clustering result is achieved.
A number of possibilities for extending this
paper.
─
─
16
link this work to Web document clustering.
apply the same model to text classification.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Comments

Advantages
─

Shortages
─

Ambiguous algorithm
Applications
─
─
17
Better similarity considering the semantic structure of
sentences in documents.
Text clustering
Text classification
Intelligent Database Systems Lab
Download