Multiview

advertisement
Background
Knowledge for
Ontology
Construction
Blaž Fortuna,
Marko Grobelnik,
Dunja Mladenić,
Institute Jožef Stefan, Slovenia
Bag-of-words
Computers are used in increasingly
diverse ways in Mathematics and the
Physical and Life Sciences. This
workshop aims to bring together
researchers in Mathematics,
Computer Science, and Sciences to
explore the links between their
disciplines and to encourage new
collaborations.
• There exist various ways
of selecting word weights.
In our paper we propose
a method to learn them!
Word Weigts
• Documents are encoded
as vectors
• Each element of vector
corresponds to frequency
of one word
• Each word can also be
weighted corresponding
to the importance of the
word
computer
0.9
mathematics
0.8
are
0.01
and
0.01
science
0.9
…
…
Important
Noise
computer
2
computer
1.8
mathematics
2
mathematics
1.6
are
1
are
0.01
and
4
and
0.04
science
3
science
2.7
…
…
…
…
SVM Feature selection
Input:
• Set documents
• Set of categories
• Each document is assigned
a subset of categories
Output:
• Ranking of words according to
importance
Intuition:
• Word is important if it
discriminates documents
according to categories.
Basic algorithm:
• Learn linear SVM classifier for
each of the categories.
• Word is important if it is
important for classification into
any of the categories.
Reference:
• Brank J., Grobelnik M., MilicFrayling N. & Mladenic D.
Feature selection using
support vector machines.
Word weight learning
• The word weight learning
method is based on SVM
feature selection.
• Besides ranking the words it
also assigns them weights
based on SVM classifier.
Notation:
• N – number of documents
• {x1, …, xN} – documents
• C(xi) – set of categories for
document xi
• n – number of words
• {w1, …, wn} – word weights
• {nj1, …, njn} – SVM normal
vector for j-th category
Algorithm:
1. Calculate linear SVM
classifier for each category
2. Calculate word weights for
each category from SVM
normal vectors. Weight for
i-th word and j-th category is:
1
j
i 
N
3.
N
x
k 1
k ,i
n j ,i
Final word weights are
calculated separately for
each document:
xk ,i


j
   i TFi
 jC ( xk ) 
OntoGen system
•
System for semi-automatic
ontology construction
– Why semi-automatic?
The system only gives
suggestions to the user, the user
always makes the final decision.
•
•
•
The system is data-driven and can
scale to large collections of
documents.
Current version focused on
construction of Topic Ontologies,
next version will be able to deal
with more general ontologies.
Can import/export RDF.
•
There is a big divide between
unsupervised and fully supervised
construction tools.
•
Both approaches have weak points:
– it is difficult to obtain desired results
using unsupervised methods, e.g.
limited background knowledge
– manual tools (e.g. Protégé,
OntoStudio) are time consuming,
user needs to know the entire
domain.
•
We combined these two
approaches in order to eliminate
these weaknesses:
– the user guides the construction
process,
– the system helps the user with
suggestions based on the document
collection.
http://kt.ijs.si/blazf/examples/ontogen.html
How does OnteGen help?
By identifying the topics and
relations between them:
… using k-means clustering:
• cluster of documents => topic
• documents are assigned to
clusters => ‘subject-of’ relation
• We can repeat clustering on a
subset of documents assigned to
a specific topic => identifies
subtopics and ‘subtopic-of’ relation
Context
Topic
By naming the topics:
… using centroid vector:
• A centroid vector of a given topic is
the average document from this topic
(normalised sum of topic’s
documents)
• Most descriptive keywords for a given
topic are the words with the highest
weights in the centroid vector.
… using linear SVM classifier:
• SVM classifier is trained to seperate
documents of the given topic from
the other document in the context
• Words that are found most mportant
for the classification are selected as
keywords for the topic
Topic
ontology
Topic ontology
visualization
Selected topic
Suggestions of
subtopics
Topic Keywords
All documents
Outlier detection
Topic document
Topic ontology of Yahoo! Finances
Background knowledge in OntoGen
• All of the methods in OntoGen are based on
bag-of-words representation.
• By using a different word weights we can tune
these methods according to the user’s needs.
• The user needs to group the documents into
categories. This can be done efficiently using
active learning.
http://kt.ijs.si/blazf/examples/ontogen.html
Influence of background knowledge
• Data: Reuters news articles
• Each news is assigned two
different sets of tags:
Topics view
– Topics
– Countries
• Each set of tags offers a
different view on the data
Countries view
Documents
Links
• OntoGen:
http://ontogen.ijs.si/
• Text Garden:
http://www.textmining.net/
Download