Training-less Ontology-based Text Categorization. Maciej Janik

advertisement
Training-less Ontology-based
Text Categorization.
Maciej Janik
LSDIS lab, Computer Science, University of Georgia
Major professor:
Dr. Krzysztof J. Kochut
Committee
Dr. John A. Miller
Dr. Khaled Rasheed
Dr. Amit P. Sheth
July 1st, 2008
Dissertation Defense
Computer Science Department
University of Georgia
Document categorization
Document classification/categorization
is a problem in information science. The
task is to assign an electronic document to
one or more categories, based on its
contents.
[Wikipedia]
2
Computer Science Department
University of Georgia
Objectives
• Document categorization method
– Classification is based on knowledge from
ontology
– Do not require training set
– Use semantic information for categorization
– Explore role of semantic associations in text
categorization
– Incorporate user interest (context) into
categorization
3
Computer Science Department
University of Georgia
Automatic document categorization
• Methods are based on word/phrase statistics, information
gain and other probability or similarity measures 1.
• Examples
– Naïve Bayes, SVM, Decision Tree, k-NN
• Categorization based on information (frequencies,
probabilities) learned from the training documents.
• Vocabulary extension/unification possible by use of
synonyms, homonyms, word groups (eg. from WordNet)
• Document representation for categorization
– Set or vector of features - most popular and simple: bag of
words
– Does not include information about document structure,
relative position of phrases, etc.
(1) Sebastiani, F. Machine learning in automated text categorization. ACM Computing
Surveys (CSUR), 34 (1). 1 - 47.
4
Computer Science Department
University of Georgia
Document categorization by people
• People categorize document by
understanding its content, using their
knowledge and understanding what the
category is.
• Categorization is based on:
–
–
–
–
Document content
Knowledge
Category
Perceived interest
entities and relationships
ontology
category definition
categorization context
5
Computer Science Department
University of Georgia
OmniCat approach
• Categorization knowledge
– Ontology
• Features
– Entities, relationships and semantic associations
• Category definitions
– Relevant fragments of ontology
– Importance of classes, entities, and relationships
• Categorization process
– Matching of a document text to find best fit into defined
ontology fragments
6
Computer Science Department
University of Georgia
Semantic associations
• Semantic Association
– A simple, undirected path that connects two
entities in the knowledge base and describe
how they are related.
– Relationships on the path define meaning of
this connection.
– Directionality of relationships sets specific
interpretation of a path.
– Entities on the path specify the content.
(1) Sheth, A. P., I. B. Arpinar, et al. (2003). Relationships at the Heart of Semantic Web:
Modeling, Discovering, and Exploiting Complex Semantic Relationships. Enhancing the
Power of the Internet: Studies in Fuzziness and Soft Computing. M. Nikravesh, B. Azvin, R.
Yager and L. Zadeh, Springer Verlag.
7
Computer Science Department
University of Georgia
Semantic Associations - Paths in RDF
Directed path
Undirected path
Undirected path,
but with specific
properties and
directionality
8
Computer Science Department
University of Georgia
BRAHMS
Maciej Janik, Krys Kochut. "BRAHMS: A WorkBench RDF Store And High Performance
Memory System for Semantic Association Discovery", Fourth International Semantic Web
Conference, ISWC 2005, Galway, Ireland, 6-10 November 2005
9
Computer Science Department
University of Georgia
BRAHMS
• Features
–
–
–
–
–
Main-memory RDF/S storage
Handle RDF and RDFS data
High performance for accessing RDF/S data
Efficient handling of large onologies
Rich API provide a framework for creating
ontology-based algorithms (e.g. semantic
association discovery)
• Separation of schema and instances
– Read-only access to ontology
• Developed for the need of SemDis1 project
(1) http://lsdis.cs.uga.edu/projects/semdis/
10
Computer Science Department
University of Georgia
Design decisions
• Performance requirements
– use main memory for storage – fastest access
– create indexes for operations used in graph
traversal algorithms
– use C/C++ in implementation instead of Java
– instead of string URIs, use simple type [int] as
resource identifiers.
• Ontology size
– compact representation for handling large
ontologies – leave some memory for algorithms
11
Computer Science Department
University of Georgia
Design decisions
• Handle RDF / S
– simplify the design and do not include and
check logic or constraints imposed by OWL
• Separate instance base from schema
– represent instances, schema classes and
properties as different object types
– have specific methods to access schema or
instances
– different types of objects require different
types of statements
12
Computer Science Department
University of Georgia
Design decisions
• Framework for algorithms
– create rich API of basic operations to access
RDF/S data
• Consequences of design decisions
– compact knowledge base to minimize memory
usage, no memory fragmentation – use
contiguous memory blocks  make it readonly
– create snapshot of memory structures for fast
start-up (parse* once, use many times)
– handle taxonomy in a special way.
13
(*) Redland’s Raptor is used as RDF/S parser – http://librdf.org/raptor
Computer Science Department
University of Georgia
Results - timing
bi-BFS on
synthetic Business-Sports-Entertainment
900
x 22.29
Jena; 847
800
700
Sesame; 386
time [sec]
600
500
400
x 10.16
9
10
11
12
12.8
39.9
59.3
847
Sesame
1.8
11.9
25.7
386
Redland
0.43
2.6
5.2
64.8
Jena
BRAMS
Found paths
0.1
0.5
1.9
38
8559
131009
1680943
24392420
BRAMS; 38
Redland; 64.8
BRAMS; 1.9
Redland; 5.2
Sesame; 25.7
Jena; 59.3
BRAMS; 0.5
Redland; 2.6
Sesame; 11.9
Jena; 39.9
Sesame; 1.8
BRAMS; 0.1
0
Jena; 12.8
200
association
length
100
[relations]
Redland; 0.43
300
45,000 Instance statements 29,889 instances RDF: 13Mb
x 1.70
14
Computer Science Department
University of Georgia
Results - timing
bi-BFS search on Univ(700,0) - 6.5Gb file
350
314,116,239
1,271,857
94,152
200
10,000,000
1,000,000
100,000
10,000
150
1,000
BRAHMS
Paths
BRAHMS; 0.33
BRAHMS; 0.15
association
length
[relations] 0
BRAHMS; 0.02
50
32
BRAHMS; 46.42
205
100
100
10
4
5
6
7
8
0.02
0.15
0.33
46.42
308.87
32
205
94,152
1,271,857
314,116,239
Found paths
[log scale]
Time [sec]
250
100,000,000
BRAHMS; 308.87
300
1,000,000,000
1
15
Computer Science Department
University of Georgia
SPARQLeR
Krys Kochut, Maciej Janik. "SPARQLeR: Extended Sparql for Semantic Association
Discovery", Fourth European Semantic Web Conference, ESWC 2007, Innsbruck, Austria,
3-7 June 2007
16
Computer Science Department
University of Georgia
SPARQLeR
• Extension of SPARQL for semantic association
discovery.
• Seamlessly integrated into the SPARQL syntax.
• Graph patterns incorporating simple paths with
constraints.
• Support for flexible length paths.
• Property constraints (path patterns) are based
on regular expressions over properties.
• Additional constraints on entities included in the
path (instances and properties).
17
Computer Science Department
University of Georgia
Path patterns in SPARQLeR
• Path is SPARQLeR is a meta-property
– Resource –[property] Resource
– Resource –[path] Resource
• Path is also a Sequence
– Test if a resource is in the path:
• rdfs:member
– Test if a resource is at a specific position in the path:
• rdf:_2, rdf:_4, ...
• SPARQLeR-specific path properties
– Test all resources or all properties in the path:
• rdfms:entityResource and rdfms:propertyResource
Example: all resources on a path must be of type foo:Person
18
Computer Science Department
University of Georgia
SPARQLeR extensions
• Path expressions
– use of regular expressions over properties
• Flexible path specification
– Undirected
– Defined directionality paths
• Directed
– Length restricted
• Complex path patterns
– Test of resources and properties on the path
– Intersecting paths
19
Computer Science Department
University of Georgia
RegExp in path constraints
• Path constraints on properties are based
on regular expressions
– Uses syntax similar to lex
– Easy for grep users
• Examples:
a
c* d
[abc] c? d
a+
(b|c) a
( b a-1 )+
c
20
Computer Science Department
University of Georgia
SPARQLeR - example
SELECT list(%path) WHERE
{<r> %path <s> .
%path rdf:_2 <e> .
%path rdfms:entityResource ?x .
?x rdf:type <foo:A>
FILTER(length(%path)<=6 &&
regex(%path,“(foo:prop -foo:rel)+”,“dih”) }
foo:rel
A
rdf:type
r
foo:prop
e
foo:rel
?x
foo:prop
rdfs:subPropertyOf
s
21
Computer Science Department
University of Georgia
Experiments
• Scalability
– Modified DBLP datasets in RDF (added random citations)
– Test on increasing dataset (adding older years of
publications)
– Search for cited publications (transitive)
PREFIX opus:
<http://lsdis.cs.uga.edu/projects/semdis/opus#>
SELECT ?end_publication WHERE {
<http://dblp.uni-trier.de/rec/bibtex/journals/ai/Huber06>
%path ?end_publication
FILTER ( length(%path)<=26 &&
regex(%path, "(opus:cites_publication)*" ) ) }
22
B. Aleman-Meza et. al. Semantic Analytics on Social Networks:
Experiences in Addressing the Problem of Conflict of Interest Detection. (WWW2006)
Computer Science Department
University of Georgia
Experiments – dataset characteristics
23
Computer Science Department
University of Georgia
Experiments – results: single source paths
Search paths up to length 26
24
Computer Science Department
University of Georgia
OmniCat
Maciej Janik, Krys Kochut. “OmniCat: Automatic Text Classification with Dynamically
Defined Categories”, 7th International Semantic Web Conference (ISWC 2008), Karlsruhe,
Germany [submitted to]
Maciej Janik, Krys Kochut. "Wikipedia in Action: Ontological Knowledge in Text
Categorization", Second IEEE International Conference on Semantic Computing, ICSC
2008, Santa Clara, CA, USA, August 2008 [to appear]
Maciej Janik, Krys Kochut. "Training-less Ontology-based Text Categorization",
Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR 2008) at
the 30th European Conference on Information Retrieval (ECIR'08), Glasgow, Scotland, 3025
March 2008
Computer Science Department
University of Georgia
Ontology
• “An explicit specification of a
conceptualization.” 1
• Ontology is a data model that represents a
set of concepts within a domain and the
relationships between those concepts. It is
used to reason about the objects within
that domain. [Wikipedia]
Gruber, T. A Translation Approach to Portable Ontology Specifications. Knowledge
Acquisition, 5 (2). 199-220, 1993.
26
Computer Science Department
University of Georgia
Ontology-based classification
• Ontology IS the knowledge base and
THE CLASSIFIER – no need for training set.
– Rich instance base defines known universe.
– Schema with taxonomy describe categorization
structure.
• Classification is based on recognized entities
in text and semantic relationships between
them.
• Categories assigned are based on entities
types, taxonomy embedded in schema and
provided categorization contexts.
27
Computer Science Department
University of Georgia
OntoCategorization – bases
• Probability
– Document is classified based on probabilities that given
feature (word, phrase) belongs to a certain category.
• Similarity
– Category is defined as ontology fragment (entities,
classes, structures, etc.)
– Similarity of document graph to given ontology fragment
describes closeness to selected category
• Connectivity (components)
– Knowledge is based on associations.
– Entities in one category should form a connected
component, as they belong to the same subject.
• Context
– Specific entities, entity types, or semantic structures
may be of different importance for user
28
Computer Science Department
University of Georgia
Graph representation of text
• Graph representation preserves (selected)
structural information from document
– Relative words positions to find close co-occurring
phrases.
– Paragraph, formatting (eg. emphasize), part of
document.
• Sample representations
– Words form a directed graph, chained in order as they
appear in each sentence.
– Words form a weighted graph, where edge connects
words within certain distance and weight determines
closeness.
– Connected terms based on NLP processing or cooccurrence.
29
Computer Science Department
University of Georgia
Graph-based categorization
• Categorization based on similarity metrics
1
– Isomorphism
– Maximum common subgraph/ minimum common
supergraph
– Graph edit distance
– Statistical methods
• Diameter, degree distribution, betwenness
– Comparison of node neighbors
– Distance preservation measure
• Methods
– k-NN – most straightforward
– similarity to centroids – graph mean and graph median
– term distance to category
(1) Schenker, A., Bunke, H., Last, M. and Kandel, A. Graph-Theoretic Techniques for Web
Content Mining. World Scientific, London, 2005.
30
Computer Science Department
University of Georgia
Classes and categories
• Classes do not have to be categories
• Classes
– Form taxonomy / partonomy
– Strict, formal requirements
– Membership based on features
• Categories
– Can include other categories, intersect with them, etc. –
more set-like approach
– Category can be a complex structure of classes,
relationships and instances
– Topic of interest that can span multiple, normally
unrelated classes in schema
31
Computer Science Department
University of Georgia
OmniCat system
32
Computer Science Department
University of Georgia
Algorithm sketch
• Semantic graph construction
– Conversion of an unstructured text into
semantic graph
• Thematic graph selection
– Setting a topic by selection of graph(s) for
categorization
• Categorization using ontology
– Bottom-up approach of category discovery
– Top-down approach with categorization context
projection
33
Computer Science Department
University of Georgia
Semantic graph construction (1)
• Named entity identification
– Matching known phrases
(literals) from ontology and
assign initial confidence
weight
– Each phrase has assigned a
confidence level based on
uniqueness of entity
identification
– Number of times each phrase
is matched suggests its
importance in text
– Text-phrase similarity is used
when applying stop words
removal or stemming
1
w  1
1   pi*s(li , mpi )
i 1..n
34
Computer Science Department
University of Georgia
Example of entity matching
Ford Motor Co. is in the process of selling
Ford Motor Company
Process (computing)
Business process Process (science)
Sales
Jaguar and Land Rover, according to Ford
Jaguar (animal)
Jaguar Cars Ltd.
CEO
Chief Executive Officer
Land_Rover
Ford Motor Company
Alan Mulally.
Alan_Mulally
35
Computer Science Department
University of Georgia
Semantic graph construction (2)
• Entity relationship extraction
– NLP parse of each sentence to get dependency
tree
– Use previously matched phrases as clues for
entities positions
– If matched phrases are close in the parse tree,
add a relationship between them in the final
graph
• OmniCat does not extract named
relationships
36
Computer Science Department
University of Georgia
Example – parse tree and triples
Ford Motor Co. is in the process of selling Jaguar and Land Rover,
according to Ford CEO Alan Mulally.
37
Computer Science Department
University of Georgia
Semantic graph construction (3)
• Connectivity
inducement
– For each pair of
matched entities find all
relationships in the
ontology
– Each relationship has
importance factor,
based on semantics of
information it defines
38
Computer Science Department
University of Georgia
Example – NLP + ontology knowledge
Ford Motor Co. is in the process of selling Jaguar and Land Rover,
according to Ford CEO Alan Mulally.
named_after
Jaguar (animal)
Jaguar Cars
Chief Executive Officer
parent_company
is_a
has_CEO
Ford Motor Company
CEO_of
Alan Mulally
sells
sells
parent_company
Land Rover
39
Computer Science Department
University of Georgia
Thematic graph selection (1)
• Removal of specific types of entities
(optional)
– Specific for news documents
– What? Who?
• Content of the news
– Where? When?
• Date, time and place
• Entities that may become hotspots in the created
document graph
40
Computer Science Department
University of Georgia
Thematic graph selection (2)
• Entity weight propagation
– Each entity has assigned initial match weight
– Entities are connected by relationships with
given importance factor
– Propagate weight using HITS 1 algorithm to
find best hub and authority entities
– Best authoritative entities are most important
for document categorization – core of the
graph
– Calculate centrality to find entities that are
1
“topic landmarks”
Centrality (v ) 
i
 d (v , v )
i
j
j
(1) Kleinberg, J.M., Authoritative Sources in a Hyperlinked Environment. in
ACM-SIAM Symposium on Discrete Algorithms, (1998).
41
Computer Science Department
University of Georgia
Thematic graph selection (3)
• Selection of the dominant thematic graph
for categorization
– Select connected component that is largest
and has maximum weight for further
categorization
– Based on assumption that entities associated
with the same or related topics are
interconnected in ontology
– Effectively disambiguate many incorrectly
matched entities
– Focus on one or few major topics of a
document
42
Computer Science Department
University of Georgia
Thematic graph examples
Chief Executive Officer
Jaguar Cars
Jaguar (animal)
Ford Motor Company
Alan Mulally
Land Rover
Announcement
Sales
Business
Buyer
News
Newspaper
43
Computer Science Department
University of Georgia
Thematic graph categorization
• Categorization concentrates on selected
dominant thematic graph
• Proposed methods
– Bottom-up category discovery
• Class-category mapping
– Top-down category projection
• Categorization based on context projection
• Combination of categorization contexts for complex
categories
44
Computer Science Department
University of Georgia
Bottom-up categorization (1)
• Category discovery approach
– No category definitions are needed, only
taxonomy from the ontology
– Bottom-up approach – discover categories
based on classification of entities
– Best category should
• Cover largest portion of entities in the thematic graph
• Be most possible direct class for entities
• Include entities from core of the graph
sCi (hmax )  1  (1 
1
wj
wk
1 

2
j h(Ci , e j )
k h(Ci , eCk )
)
45
Computer Science Department
University of Georgia
Bottom-up class discovery
46
Computer Science Department
University of Georgia
Bottom-up categorization (2)
• External categories are given as set of
classes
– In case of Wikipedia and external corpora,
categories are defined as mapping of
appropriate Wikipedia categories
• Previously discovered categories are
matched with categories definitions
– Top-k are considered for matching
– Matching until one category becomes dominant
47
Computer Science Department
University of Georgia
Entities and categories
Car Manufacturers
Felines
Living people
Ford
Off-road wehicles
Pantherinae
Ford people
Jaguar
Panthera
Ford executives
Jaguar Cars
Alan Mulally
Jaguar (animal)
Ford Motor Company
Chief Executive Officer
Land Rover
48
Computer Science Department
University of Georgia
Example
Ford, utility ready to work on plug-in car Automaker, Southern California Edison to
unveil alliance in response to demand for energy-efficient vehicles.
DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison
will announce an unusual alliance Monday aimed at clearing the way for a new
generation of rechargeable electric cars, the companies said.
Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International
(Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with
reporters at Edison's headquarters in Rosemead, Calif., the companies said.
[...]
Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles
uses batteries to power the vehicle at low speeds and in to provide assistance
during stop-and-go traffic and hard acceleration, delivering higher fuel economy.
General Motors Corp. (Charts , Fortune 500) has already begun work this year to
develop its own plug-in hybrid car, designed to use little or no gasoline over short
distances. The company showed off a concept version of the Chevrolet Volt in
January at the Detroit Auto show and has awarded contracts to two battery makers
49
to research advanced batteries for a possible production version.
Computer Science Department
University of Georgia
Example
Ford, utility ready to work on plug-in car Automaker, Southern California Edison to
unveil alliance in response to demand for energy-efficient vehicles.
DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison
will announce an unusual alliance Monday aimed at clearing the way for a new
generation of rechargeable electric cars, the companies said.
Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International
(Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with
reporters at Edison's headquarters in Rosemead, Calif., the companies said.
[...]
Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles
uses batteries to power the vehicle at low speeds and in to provide assistance
during stop-and-go traffic and hard acceleration, delivering higher fuel economy.
General Motors Corp. (Charts , Fortune 500) has already begun work this year to
develop its own plug-in hybrid car, designed to use little or no gasoline over short
distances. The company showed off a concept version of the Chevrolet Volt in
January at the Detroit Auto show and has awarded contracts to two battery makers
50
to research advanced batteries for a possible production version.
Computer Science Department
University of Georgia
51
Computer Science Department
University of Georgia
Example: graph properties
•
•
•
•
•
Initial number of vertexes: 205
Initial number of edges
: 361
Largest component
: 95
Component for analysis
: 35
Central and most important entities:
– Hybrid_vehicle
* Centrality 208, * weight 1.516873
– Automobile
* Centrality 213, weight 1.249790,
– Internal_combustion_engine
* Centrality 233, weight 1.069511
– Ford_Motor_Company
Centrality 237, * weight 1.451533,
– Southern_California_Edison
Centrality 351, * weight 1.308824
52
Computer Science Department
University of Georgia
Example: assigned categories
• Category:Automobiles
– CAT instances <13>, (avg. height 2.384615)
weight [0.874697]
• Category:Alternative_propulsion
– CAT instances <4>, (avg. height 1.250000)
weight [0.873287]
• Category:Car_manufacturers
– instances <3> (avg. height 1.000000)
weight [0.781271]
• Category:Vehicles
– CAT instances <13>, (avg. height 2.923077)
weight [0.647903]
• Category:Transportation
– CAT instances <11>, (avg. Height 3.090909)
weight [0.629714]
53
Computer Science Department
University of Georgia
Top-down approach
• Need externally defined categories
– Categories are given as classification contexts
– Category can be defined as combination of
contexts
• Categorization process
– Each context is projected onto the thematic
graph
– Fitness score for each context is calculated
– In case when category is defined as linear
combination of contexts, cosine similarity for
fitness score is calculated
54
Computer Science Department
University of Georgia
Categorization context
• Simplify definition of categories by classes
and projection.
• Capture better user interest in categories
to specify preferred type of entities.
• Define union, intersection, and difference
of contexts for flexible context definition.
• Enable creating combination of contexts
for defining more complex categories.
55
Computer Science Department
University of Georgia
Hierarchical distance and projection
• Distance between entity and
class – number of rdf:type and
rdfs:subClassOf properties
• Distance between entity and
set of classes – minimum
distance to all classes in the
set
• Entity is not covered by a class
(or any class in the set) –
distance is zero
• Projection of context on
instance base – instances with
assigned hierarchical distance
56
Computer Science Department
University of Georgia
Categorization into contexts
• Fitness score for context
fs(C , T )   wk * h(dist H (ek , C ))   wcn * hc (dist H (ecn , C ))
k
n
• Hierarchical distance
weighting function
h(dist H (e, C ))  N (1, 2) (dist H (e, C ))
to emphasize the weight of the
nearest classes
57
Computer Science Department
University of Georgia
Categorization context example
Business
Person
( Business  Person )  Business
58
Computer Science Department
University of Georgia
Complex categories - composition of
contexts
bs
bs
b combined with s
Linear combination
of contexts
59
Computer Science Department
University of Georgia
Top-down categorization
• For each defined categorization context
calculate a fitness score using context
projection onto instance base
– If there are only “simple” context, fitness
scores can be compared directly to choose
category
– Otherwise, create a vector space from the
calculated fitness scores and calculate
similarity (cosine) between category definition
and context vector
60
Computer Science Department
University of Georgia
Top-down classification
61
Computer Science Department
University of Georgia
Experiments (1)
• Classic text categorization algorithms
– BOW statistic classifier 1
– SVM implemented in Weka
2
• Text corpora
– CNN (2007-07-03 – 2007-09-04)
• 2,590 news documents in 12 categories
– Reuters RCV1 (1996-08-20 – 1996-09-02)
• 2,254 documents in 6 categories
• Mapping for Wikipedia categories
– Created manually by mapping top Wikipedia categories
with corpora categories
(1) McCallum, A.K. Bow: A toolkit for statistical language modeling, text retrieval, classification
and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
62
(2) Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques (2nd
ed.). Morgan Kaufmann, San Francisco (2005)
Computer Science Department
University of Georgia
Experiments (2)
• Wikipedia ontology
– Includes around 2,000,000 entries
• Multiple entity names (variations for matching)
– Has rich instance base (articles)
– Internal href, templates and “infobox” relations
carry semantic connections among entries
– Has large schema with categories – over
310,00 categories
• They DO NOT form a taxonomy, just a graph (even
include cycles)
63
Computer Science Department
University of Georgia
Experiments (3)
• Wikipedia 2 RDF
– Created initially by dbpedia.org 1
– Creation of RDF – some modifications
• Focus on href, infoboxes and templates
– Special relationships for entities in infoboxes and
templates
– Only English version of Wikipedia
• Entity name variations for matching
– Name, short name (no brackets), redirect,
disambiguation, alternate names
(1) Auer, S. and Lehmann, J., What have Innsbruck and Leipzig in common? Extracting
Semantics from Wiki Content. in European Semantic Web Conference (ESWC'07), (Innsbruck,
Austria, 2007), Springer, 503-517.
64
Computer Science Department
University of Georgia
Wikipedia categories
• Wikipedia categories DO NOT form a taxonomy
– It is just a directed graph, that contains cycles.
– Not possible to use subsumption for categories.
– Thesaurus-like structure 1.
• Categories may be very deep and detailed, or
very broad
– Hard to pinpoint the cut-off point good for
categorization.
– There is no simple mapping between news categories
and categories in Wikipedia.
(1) Voss, J. Collaborative thesaurus tagging the Wikipedia way. ArXiv Computer Science eprints, cs/0604036.
65
Computer Science Department
University of Georgia
Text corpora information
66
Computer Science Department
University of Georgia
Text corpora – CNN mapping
67
Computer Science Department
University of Georgia
Text corpora – Reuters mapping
68
Computer Science Department
University of Georgia
Bottom-up categorization - OmniCat
OmniCat results using Wikipedia-CNN category mapping
69
Computer Science Department
University of Georgia
Bottom-up categorization – BOW
BOW results on CNN corpora using Wikipedia training
70
Computer Science Department
University of Georgia
Bottom-up categorization – BOW (2)
BOW results on Wikipedia corpora using Wikipedia training
71
Computer Science Department
University of Georgia
Bottom-up categorization - Reuters
Comparison of BOW, SVM and OmniCat (bottom-up approach)
on selected Reuters corpora
72
Computer Science Department
University of Georgia
Top-down categorization - OmniCat
OmniCat results on CNN corpora using top-down approach
with categorization context projection
73
Computer Science Department
University of Georgia
OmniCat categorization – CNN
Comparison of CNN corpora categorization results of BOW, SVM,
OmniCat bottom-up (Onto), and OmniCat top-down (OmniCat) 74
Computer Science Department
University of Georgia
OmniCat categorization – Reuters
Comparison of Reuters corpora categorization results of BOW, SVM,
OmniCat bottom-up (Onto), and OmniCat top-down (OmniCat) 75
Computer Science Department
University of Georgia
Misclassifications - text corpora and
Wikipedia
• Original text corpora categories
– Classified by people
– Describe mostly article interest, not necessarily
its content
• Frequently described reader’s interest rather than
true subject.
– Hard to match to Wikipedia categories
• Wikipedia categories
– Content-based
– Very detailed and deep
– Some regions in ontology are better developed
76
Computer Science Department
University of Georgia
Summary of work
• Ontology storage and querying
– Brahms RDF/S storage
– Sparqler – query language extension with path queries
• For use in Glycomics project
• OmniCat - Ontology-based categorization
– Methodology for ontology-based categorization
– Proposed two schemes of categorization
– Defined categorization context, combination of contexts
for categorization
– Implemented OmniCat prototype
– Experiments using general-purpose ontology – RDF/S
graph created from the English Wikipedia
– Published at ESAIR’08 and ICSC’08, submitted to
ISWC’08
77
Computer Science Department
University of Georgia
Proposed work
• Experiment with other ontologies and taxonomies for
categorization
– Use categories extracted from Freebase or Dmoz
– Categorize medical publications to MeSH using Wikipedia
references
• Approach to categorization
– Include definitions of interesting structures (e.g. specific
semantic associations) into categorization context
– Utilize context information in calculating and selecting the
document core entities
– Use other similarity metrics for calculating thematic graph and
ontology similarity
• OmniCat beyond text categorization
– Study applicability of OmniCat approach for categorizing
ontologies with other (gold standard) ontologies
– Document summarization using semantic graph (towards
proposition presented in [1])
(1) Leskovec, J., M. Grobelnik, et al. (2004). Learning Semantic Graph Mapping for Document
Summarization. 8th European Conference on Principles and Practice of Knowledge Discovery in
Databases (PKDD), Pisa, Italy.
78
Computer Science Department
University of Georgia
Published papers
•
•
•
•
•
•
•
Maciej Janik, Krys Kochut. "BRAHMS: A WorkBench RDF Store And High
Performance Memory System for Semantic Association Discovery", Fourth
International Semantic Web Conference, ISWC 2005, Galway, Ireland, 6-10
November 2005
Krys Kochut, Maciej Janik. "SPARQLeR: Extended Sparql for Semantic Association
Discovery", Fourth European Semantic Web Conference, ESWC 2007, Innsbruck,
Austria, 3-7 June 2007
Matthew Perry, Maciej Janik, Cartic Ramakrishnan, Conrad Ibanez, Budak Arpinar,
Amit Sheth. "Peer-to-Peer Discovery of Semantic Associations", Second International
Workshop on Peer-to-Peer Knowledge Management, San Diego, CA, July 17, 2005
Maciej Janik, Krys Kochut. "Wikipedia in Action: Ontological Knowledge in Text
Categorization", Second IEEE International Conference on Semantic Computing, ICSC
2008, Santa Clara, CA, USA, August 2008 [to appear]
Maciej Janik, Krys Kochut. "Training-less Ontology-based Text Categorization",
Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR 2008)
at the 30th European Conference on Information Retrieval (ECIR'08), Glasgow,
Scotland, 30 March 2008
Matthew Eavenson, Maciej Janik, Shravya Nimmagadda, John A. Miller, Krys J.
Kochut, William S. York. "GlycoBrowser - A Tool for Contextual Visualization of
Biological Data and Pathways Using Ontologies", 4-th International Symposium on
Bioinformatics Research and Applications (ISBRA2008), Atlanta, Georgia (May 2008)
S. Nimmagadda, A. Basu, M. Evenson, J. Han, M. Janik, R. Narra, K. Nimmagadda,
A. Sharma, K.J. Kochut, J.A. Miller and W. S. York, "GlycoVault: A Bioinformatics
Infrastructure for Glycan Pathway Visualization, Analysis and Modeling," Proceedings
of the 5th International Conference on Information Technology: New Generations
(ITNG'08), Las Vegas, Nevada (April 2008)
79
Download