Training-less Ontology-based Text Categorization. Maciej Janik December 14

advertisement
Training-less Ontology-based
Text Categorization.
Maciej Janik
Major professor:
Dr. Krzysztof J. Kochut
Committee
Dr. John A. Miller
Dr. Khaled Rasheed
Dr. Amit P. Sheth
December 14th, 2007
PhD Prospectus presentation
Computer Science Department
University of Georgia
Outline
•
•
•
•
•
•
•
•
•
Document categorization …
Classic approach to categorization
Graph categorization and similarity metrics
Ontology-based approach to categorization
Algorithm sketch
Algorithm details and assumptions
Example and preliminary results
Planned work and expected results
References
2
Computer Science Department
University of Georgia
Document categorization
Document classification/categorization
is a problem in information science. The
task is to assign an electronic document to
one or more categories, based on its
contents.
[Wikipedia]
3
Computer Science Department
University of Georgia
Document categorization by people
• People categorize document by
understanding its content, using their
knowledge and understanding what the
category is.
• Categorization is based on:
–
–
–
–
Document content
Knowledge
Category
Perceived interest
features, graph
ontology
category definition
categorization context
4
Computer Science Department
University of Georgia
Automatic text categorization
• Automatic text classification can be
defined as task of assigning category
labels to new documents based on the
knowledge gained in a classification
system at the training stage.
– require training with pre-classified documents
• Proposed solution
– use already defined knowledge for document
categorization and skip the training stage
5
Computer Science Department
University of Georgia
Classic categorization
• Methods are based on word/phrase statistics, information
gain and other probability or similarity measures.
• Examples [Sebastiani]
– Naïve Bayes, SVM, Decision Tree, k-NN
• Categorization based on information (frequencies,
probabilities) learned from the training documents.
• Vocabulary extension/unification possible by use of
synonyms, homonyms, word groups (eg. from WordNet)
• Document representation for categorization
– Set or vector of features - most popular and simple: bag of
words
– Does not include information about document structure,
relative position of phrases, etc.
6
Computer Science Department
University of Georgia
Graph representation of text
• Graph representation preserves (selected)
structural information from document
– Relative words positions to find close co-occurring
phrases.
– Paragraph, formatting (eg. emphasize), part of
document.
• Sample representations
– Words form a directed graph, chained in order as they
appear in each sentence.
– Words form a weighted graph, where edge connects
words within certain distance and weight determines
closeness.
– Connected terms based on NLP processing or cooccurrence.
7
Computer Science Department
University of Georgia
Graph representations - examples
[Schenker]
[Gamon]
8
Computer Science Department
University of Georgia
Graph-based categorization
• Categorization based on similarity metrics [Schenker]
–
–
–
–
Isomorphism
Maximum common subgraph/ minimum common supergraph
Graph edit distance
Statistical methods
• Diameter, degree distribution, betwenness
– Comparison of node neighbors
– Distance preservation measure
• Methods
– k-NN – most straightforward
– similarity to centroids – graph mean and graph median
– term distance to category
9
Computer Science Department
University of Georgia
Ontology
• “An explicit specification of a
conceptualization.” [Tom Gruber]
• Ontology is a data model that represents a
set of concepts within a domain and the
relationships between those concepts. It is
used to reason about the objects within
that domain. [Wikipedia]
10
Computer Science Department
University of Georgia
Ontology - example
11
Computer Science Department
University of Georgia
Use of ontologies in classification
•
•
•
•
Term unification
Hierarchy of concepts
Entity recognition and disambiguation
Strengthening co-occurrence of related
entities
• Nearest neighbors
12
Computer Science Department
University of Georgia
Ontology-based classification
• Ontology IS the knowledge base and
THE CLASSIFIER – no need for training set.
– Rich instance base defines known universe.
– Schema with taxonomy describe categorization
structure.
• Classification is based on recognized entities
in text and semantic relationships between
them.
• Categories assigned are based on entities
types and taxonomy embedded in schema.
13
Computer Science Department
University of Georgia
OntoCategorization – bases
• Probability
– Traditionally, document is classified based on
probabilities that given feature (word, phrase) belongs
to a certain category.
– Here: the more features belong to a category, the more
probable that document belongs to the category.
• Similarity
– Category is defined as ontology fragment (entities,
classes, structures, etc.)
– Similarity of document graph to given ontology fragment
describes closeness to selected category
• Connectivity (components)
– Knowledge is based on associations.
– Entities in one category should form a connected
component, as they belong to the same subject.
14
Computer Science Department
University of Georgia
Classes and categories
• Classes do not have to be categories
• Classes
– Form taxonomy / partonomy
– Strict, formal requirements
– Membership based on features
• Categories
– Can include other categories, intersect with them, etc. –
more set-like approach
– Category can be a complex structure of classes,
relationships and instances
– Topic of interest that can span multiple, normally
unrelated classes in schema
15
Computer Science Department
University of Georgia
Who? What? Where? When? Why?
• WWW – What (who)? Where? When?
– These text dimensions are orthogonal (in most text).
– Fairly easy to find place and date/time.
– What / who – description of article’s topic .
• Ontology classification
– Focus on text core – find ‘what’ and ‘who’ by matching
entities.
– Recognize relationships between entities to construct an
initial document graph.
– Graph overlay from ontology on core entities reveals
semantics from background knowledge of analyzed text.
• Why? Hmm …
16
Computer Science Department
University of Georgia
OntoCategorization system
17
Computer Science Department
University of Georgia
Algorithm sketch
• Convert text to thematic graph
–
–
–
–
From words to entities (spotting).
Extract relationships and form triples (NLP).
Overlay background knowledge.
Remove unwanted entities (time/place).
• Categorize graph using ontology
– Select thematic component to categorization
(disambiguation and topic set)
– Find best category coverage for selected
thematic graph.
18
Computer Science Department
University of Georgia
Algorithm sketch – more details
• Match phrases in text with entities in ontology and assign
initial weight.
• Graph overlay – add relationships from ontology between
matched entities.
• Mark / remove entities related to dates and places.
• Add extracted relationships (NLP) between recognized
entities.
• Propagate entity weight in graph in similar way as in hubsauthorities algorithm [Kleinberg].
• Find thematic graph(s) for further analysis – connected
component.
• Calculate most important entities based on weight and
graph centrality.
• Find categories in schema that cover largest part of
thematic component, are lowest in hierarchy and include
most important entities.
19
Computer Science Department
University of Georgia
Experiments
• Wikipedia ontology
– Includes around 2,000,000 entries
• Multiple entity names (variations for matching)
– Has rich instance base (articles)
– Internal href, templates and “infobox” relations
carry semantic connections among entries
– Has large schema with categories – over
310,00 categories
• They DO NOT form a taxonomy, just a graph (even
include cycles)
20
Computer Science Department
University of Georgia
Experiments (2)
• Wikipedia 2 RDF
– Created initially by dbpedia.org
[Auer, Lehmann]
– Creation of RDF – some modifications
• Focus on href, infoboxes and templates
– Special relationships for entities in infoboxes and
templates
• Only English version of Wikipedia
• Entity name variations for matching
– Name, short name (no brackets), redirect,
disambiguation, alternate names
21
Computer Science Department
University of Georgia
Algorithm details (1)
• Entity name matching
– Entities and relationships are the content of
document – they define topic(s).
– Ontology defines known entities, literals or
phrases assigned to them and classifications.
– Analyzed text must contain some of these
entities to be categorizable – otherwise it is
outside of the ontology scope.
– Matching assigns spotted phrases to known
literals, and later to entities.
• Possible use of stop words and/or stemming.
22
Computer Science Department
University of Georgia
Example of entity matching
Ford Motor Co. is in the process of selling
Ford Motor Company
Process (computing)
Business process Process (science)
Sales
Jaguar and Land Rover, according to Ford
Jaguar (animal)
Jaguar Cars Ltd.
CEO
Chief Executive Officer
Land_Rover
Ford Motor Company
Alan Mulally.
Alan_Mulally
23
Computer Science Department
University of Georgia
Algorithm details (2)
• Semantic graph construction
– Add relationships between recognized entities from
ontology, as ontology defines meaningful (semantic)
connections between them.
– Add relationships extracted from NLP analysis of
annotated text.
– Connected entites enable to perform graph analysis,
connectivity, finding paths, etc.
• Date and place elimination
– Dates and places are orthogonal to topic.
– Path connecting entities through place or date is very
little meaningful for document topic.
24
Computer Science Department
University of Georgia
Example – parse tree and triples
Ford Motor Co. is in the process of selling Jaguar and Land Rover,
according to Ford CEO Alan Mulally.
25
Computer Science Department
University of Georgia
Example – NLP + ontology knowledge
Ford Motor Co. is in the process of selling Jaguar and Land Rover,
according to Ford CEO Alan Mulally.
named_after
Jaguar (animal)
Jaguar Cars
Chief Executive Officer
parent_company
is_a
has_CEO
Ford Motor Company
CEO_of
Alan Mulally
sells
sells
parent_company
Land Rover
26
Computer Science Department
University of Georgia
Algorithm details (3)
• Weight propagation
– Each entity has its initial weight assigned by
strength of phrase matching.
– Like in the web, entities are interconnected
influence each other.
– We are looking for ‘authority’ entities –
assumption is they are most representative for
topic.
27
Computer Science Department
University of Georgia
Algorithm details (4)
• Thematic subgraph in matched graph
– Assumption is that entities associated with the
same or related topics are interconnected in
ontology – same as in real life.
– Graph component = topic-related entites.
– Each document (or document fragment)
should treat about one or two main topics –
leave only most important (weight) and largest
component(s).
28
Computer Science Department
University of Georgia
Thematic graph examples
Chief Executive Officer
Jaguar Cars
Jaguar (animal)
Ford Motor Company
Alan Mulally
Land Rover
Announcement
Sales
Business
Buyer
News
Newspaper
29
Computer Science Department
University of Georgia
Algorithm details (5)
• Most important and central entities
– Topic tends to center around few entites that
are either most important (weight) or are most
central in graph.
– Also classification of whole subgraph should be
a subset of possible classification of these
entities.
30
Computer Science Department
University of Georgia
Algorithm details (6)
• Categorization
– Category is defined as set and/or hierarchy of
classes defined in ontology schema.
– Each entity has a hierarchy of assigned
categories.
– Best ontology class for graph should:
•
•
•
•
Cover maximum number of entities in the graph.
Be on relatively lowest level in hierarchy.
Be close in hierarchy to classified entity.
Include most important entities (the more, the
better)
31
Computer Science Department
University of Georgia
Entities and categories
Car Manufacturers
Felines
Living people
Ford
Off-road wehicles
Pantherinae
Ford people
Jaguar
Panthera
Ford executives
Jaguar Cars
Alan Mulally
Jaguar (animal)
Ford Motor Company
Chief Executive Officer
Land Rover
32
Computer Science Department
University of Georgia
Longer example
Ford, utility ready to work on plug-in car Automaker, Southern California Edison to
unveil alliance in response to demand for energy-efficient vehicles.
DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison
will announce an unusual alliance Monday aimed at clearing the way for a new
generation of rechargeable electric cars, the companies said.
Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International
(Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with
reporters at Edison's headquarters in Rosemead, Calif., the companies said.
[...]
Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles
uses batteries to power the vehicle at low speeds and in to provide assistance
during stop-and-go traffic and hard acceleration, delivering higher fuel economy.
General Motors Corp. (Charts , Fortune 500) has already begun work this year to
develop its own plug-in hybrid car, designed to use little or no gasoline over short
distances. The company showed off a concept version of the Chevrolet Volt in
January at the Detroit Auto show and has awarded contracts to two battery makers
33
to research advanced batteries for a possible production version.
Computer Science Department
University of Georgia
Longer example
Ford, utility ready to work on plug-in car Automaker, Southern California Edison to
unveil alliance in response to demand for energy-efficient vehicles.
DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison
will announce an unusual alliance Monday aimed at clearing the way for a new
generation of rechargeable electric cars, the companies said.
Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International
(Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with
reporters at Edison's headquarters in Rosemead, Calif., the companies said.
[...]
Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles
uses batteries to power the vehicle at low speeds and in to provide assistance
during stop-and-go traffic and hard acceleration, delivering higher fuel economy.
General Motors Corp. (Charts , Fortune 500) has already begun work this year to
develop its own plug-in hybrid car, designed to use little or no gasoline over short
distances. The company showed off a concept version of the Chevrolet Volt in
January at the Detroit Auto show and has awarded contracts to two battery makers
34
to research advanced batteries for a possible production version.
Computer Science Department
University of Georgia
35
Computer Science Department
University of Georgia
Longer example graph properties
•
•
•
•
•
Initial number of vertexes: 205
Initial number of edges
: 361
Largest component
: 95
Component for analysis
: 35
Central and most important entities:
– Hybrid_vehicle
* Centrality 208, * weight 1.516873
– Automobile
* Centrality 213, weight 1.249790,
– Internal_combustion_engine
* Centrality 233, weight 1.069511
– Ford_Motor_Company
Centrality 237, * weight 1.451533,
– Southern_California_Edison
Centrality 351, * weight 1.308824
36
Computer Science Department
University of Georgia
Longer example categories
• Category:Automobiles
– CAT instances <13>, (avg. height 2.384615)
weight [0.874697]
• Category:Alternative_propulsion
– CAT instances <4>, (avg. height 1.250000)
weight [0.873287]
• Category:Car_manufacturers
– instances <3> (avg. height 1.000000)
weight [0.781271]
• Category:Vehicles
– CAT instances <13>, (avg. height 2.923077)
weight [0.647903]
• Category:Transportation
– CAT instances <11>, (avg. Height 3.090909)
weight [0.629714]
37
Computer Science Department
University of Georgia
Wikipedia categories
• Wikipedia categories DO NOT form a taxonomy
– It is just a directed graph, that contains cycles.
– Not possible to use subsumption for categories.
– Thesaurus-like structure. [Voss]
• Categories may be very deep and detailed, or
very broad
– Hard to pinpoint the cut-off point good for
categorization.
– There is no simple mapping between news categories
and categories in Wikipedia.
38
Computer Science Department
University of Georgia
Overall performance of initial tests
• Tests against classic BOW statistic
classifier [McCallum].
• Source articles and categories taken from
CNN – total of 7158 documents in 14
categories.
– Divided into 50% training / 50% testing split
• Mapping between Wikipedia and CNN
categories done manually by crawling
generated Wikipedia schema (still not
really precise)
39
Computer Science Department
University of Georgia
Text corpora – CNN news
40
Computer Science Department
University of Georgia
CNN and Wikipedia
• CNN categories
– Classified by people
– Describe mostly article interest, not necessarily
its content
• Frequently described reader’s interest rather than
true subject.
– Hard to match to Wikipedia categories
• Wikipedia categories
– Content-based
– Very detailed and deep
41
Computer Science Department
University of Georgia
Categorization results - BOW
42
Computer Science Department
University of Georgia
Categorization results – BOW on Wikipedia
43
Computer Science Department
University of Georgia
Categorization results - Wikipedia
44
Computer Science Department
University of Georgia
Summary of work
• Ontology storage and querying
– Brahms RDF/S storage
– Sparqler – query language extension with path queries
• For use in Glycomics project
• Prototype of ontology-based categorization
– Partial implementation – not all modules included yet
– Use of general-purpose ontology – RDF graph created
from English Wikipedia
– Initial tests confirm proof of concept
– Published as technical report, submitted to WWW 2008
45
Computer Science Department
University of Georgia
Remaining research
• Goal
– Create comprehensive model for ontologybased categorization.
• Create semantic context definition
• Modify and/or create graph similarity
measures that exploit context information
46
Computer Science Department
University of Georgia
Current work in progress
• Goal
– Create a system, where user can categorize
text document with given ontology using
specified semantic context.
• NLP module for relationship extraction
• Definition of query context
– Extension of SPARQL with context queries
47
Computer Science Department
University of Georgia
Proposed work
• Include NLP analysis in creating relationships between
entities
– Will help to link entities that do not have connection in
ontology or strengthen this connection.
• Explore categorization to a user-defined context (collection
of instances, classes, structures, path expressions).
• Extend definition of category to include context.
• Experiment with other well-developed ontologies to
categorize more specialized documents
– Eg. PubMed
• (optional) Study the applicability of the method for
ontology-based document summarization.
48
Computer Science Department
University of Georgia
Published papers
•
•
•
•
•
Maciej Janik, Krys Kochut. "BRAHMS: A WorkBench RDF Store And High
Performance Memory System for Semantic Association Discovery", Fourth
International Semantic Web Conference, ISWC 2005, Galway, Ireland, 610 November 2005
Krys Kochut, Maciej Janik. "SPARQLeR: Extended Sparql for Semantic
Association Discovery", Fourth European Semantic Web Conference, ESWC
2007, Innsbruck, Austria, 3-7 June 2007
Matthew Perry, Maciej Janik, Cartic Ramakrishnan, Conrad Ibanez, Budak
Arpinar, Amit Sheth. "Peer-to-Peer Discovery of Semantic Associations",
Second International Workshop on Peer-to-Peer Knowledge Management,
San Diego, CA, July 17, 2005
Maciej Janik, Krys Kochut. "Wikipedia in action: Ontological Knowledge in
Text Categorization", UGA Technical Report No. UGA-CS-TR-07-001,
November 2007 – submitted to WWW 2008
S. Nimmagadda, A. Basu, M. Evenson, J. Han, M. Janik, R. Narra, K.
Nimmagadda, A. Sharma, K.J. Kochut, J.A. Miller and W. S. York,
"GlycoVault: A Bioinformatics Infrastructure for Glycan Pathway
Visualization, Analysis and Modeling," Proceedings of the 5th International
Conference on Information Technology: New Generations (ITNG'08), Las
Vegas, Nevada (April 2008) [to appear]
49
Computer Science Department
University of Georgia
References
•
•
•
•
•
•
•
•
•
Auer, S. and Lehmann, J., What have Innsbruck and Leipzig in common?
Extracting Semantics from Wiki Content. in European Semantic Web
Conference (ESWC'07), (Innsbruck, Austria, 2007), Springer, 503-517.
Gamon, M., Graph-Based Text Representation for Novelty Detection. in
Workshop on TextGraphs at HLT-NAACL 2006, (New York, NY, US, 2006).
Gruber, T. A Translation Approach to Portable Ontology Specifications.
Knowledge Acquisition, 5 (2). 199-220, 1993.
Kleinberg, J.M., Authoritative Sources in a Hyperlinked Environment. in
ACM-SIAM Symposium on Discrete Algorithms, (1998).
McCallum, A.K. Bow: A toolkit for statistical language modeling, text
retrieval, classification and clustering.
http://www.cs.cmu.edu/~mccallum/bow, 1996.
Nagarajan, M., Sheth, A.P., Aguilera, M., Keeton, K., Merchant, A. and
Uysal, M. Altering Document Term Vectors for Classification - Ontologies
as Expectations of Cooccurrence LSDIS Technical Report, November, 2006.
Schenker, A., Bunke, H., Last, M. and Kandel, A. Graph-Theoretic
Techniques for Web Content Mining. World Scientific, London, 2005.
Sebastiani, F. Machine learning in automated text categorization. ACM
Computing Surveys (CSUR), 34 (1). 1 - 47.
Voss, J. Collaborative thesaurus tagging the Wikipedia way. ArXiv
Computer Science e-prints, cs/0604036.
50
Download