Training-less Ontology-based Text Categorization. Maciej Janik Major professor: Dr. Krzysztof J. Kochut Committee Dr. John A. Miller Dr. Khaled Rasheed Dr. Amit P. Sheth December 14th, 2007 PhD Prospectus presentation Computer Science Department University of Georgia Outline • • • • • • • • • Document categorization … Classic approach to categorization Graph categorization and similarity metrics Ontology-based approach to categorization Algorithm sketch Algorithm details and assumptions Example and preliminary results Planned work and expected results References 2 Computer Science Department University of Georgia Document categorization Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. [Wikipedia] 3 Computer Science Department University of Georgia Document categorization by people • People categorize document by understanding its content, using their knowledge and understanding what the category is. • Categorization is based on: – – – – Document content Knowledge Category Perceived interest features, graph ontology category definition categorization context 4 Computer Science Department University of Georgia Automatic text categorization • Automatic text classification can be defined as task of assigning category labels to new documents based on the knowledge gained in a classification system at the training stage. – require training with pre-classified documents • Proposed solution – use already defined knowledge for document categorization and skip the training stage 5 Computer Science Department University of Georgia Classic categorization • Methods are based on word/phrase statistics, information gain and other probability or similarity measures. • Examples [Sebastiani] – Naïve Bayes, SVM, Decision Tree, k-NN • Categorization based on information (frequencies, probabilities) learned from the training documents. • Vocabulary extension/unification possible by use of synonyms, homonyms, word groups (eg. from WordNet) • Document representation for categorization – Set or vector of features - most popular and simple: bag of words – Does not include information about document structure, relative position of phrases, etc. 6 Computer Science Department University of Georgia Graph representation of text • Graph representation preserves (selected) structural information from document – Relative words positions to find close co-occurring phrases. – Paragraph, formatting (eg. emphasize), part of document. • Sample representations – Words form a directed graph, chained in order as they appear in each sentence. – Words form a weighted graph, where edge connects words within certain distance and weight determines closeness. – Connected terms based on NLP processing or cooccurrence. 7 Computer Science Department University of Georgia Graph representations - examples [Schenker] [Gamon] 8 Computer Science Department University of Georgia Graph-based categorization • Categorization based on similarity metrics [Schenker] – – – – Isomorphism Maximum common subgraph/ minimum common supergraph Graph edit distance Statistical methods • Diameter, degree distribution, betwenness – Comparison of node neighbors – Distance preservation measure • Methods – k-NN – most straightforward – similarity to centroids – graph mean and graph median – term distance to category 9 Computer Science Department University of Georgia Ontology • “An explicit specification of a conceptualization.” [Tom Gruber] • Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. [Wikipedia] 10 Computer Science Department University of Georgia Ontology - example 11 Computer Science Department University of Georgia Use of ontologies in classification • • • • Term unification Hierarchy of concepts Entity recognition and disambiguation Strengthening co-occurrence of related entities • Nearest neighbors 12 Computer Science Department University of Georgia Ontology-based classification • Ontology IS the knowledge base and THE CLASSIFIER – no need for training set. – Rich instance base defines known universe. – Schema with taxonomy describe categorization structure. • Classification is based on recognized entities in text and semantic relationships between them. • Categories assigned are based on entities types and taxonomy embedded in schema. 13 Computer Science Department University of Georgia OntoCategorization – bases • Probability – Traditionally, document is classified based on probabilities that given feature (word, phrase) belongs to a certain category. – Here: the more features belong to a category, the more probable that document belongs to the category. • Similarity – Category is defined as ontology fragment (entities, classes, structures, etc.) – Similarity of document graph to given ontology fragment describes closeness to selected category • Connectivity (components) – Knowledge is based on associations. – Entities in one category should form a connected component, as they belong to the same subject. 14 Computer Science Department University of Georgia Classes and categories • Classes do not have to be categories • Classes – Form taxonomy / partonomy – Strict, formal requirements – Membership based on features • Categories – Can include other categories, intersect with them, etc. – more set-like approach – Category can be a complex structure of classes, relationships and instances – Topic of interest that can span multiple, normally unrelated classes in schema 15 Computer Science Department University of Georgia Who? What? Where? When? Why? • WWW – What (who)? Where? When? – These text dimensions are orthogonal (in most text). – Fairly easy to find place and date/time. – What / who – description of article’s topic . • Ontology classification – Focus on text core – find ‘what’ and ‘who’ by matching entities. – Recognize relationships between entities to construct an initial document graph. – Graph overlay from ontology on core entities reveals semantics from background knowledge of analyzed text. • Why? Hmm … 16 Computer Science Department University of Georgia OntoCategorization system 17 Computer Science Department University of Georgia Algorithm sketch • Convert text to thematic graph – – – – From words to entities (spotting). Extract relationships and form triples (NLP). Overlay background knowledge. Remove unwanted entities (time/place). • Categorize graph using ontology – Select thematic component to categorization (disambiguation and topic set) – Find best category coverage for selected thematic graph. 18 Computer Science Department University of Georgia Algorithm sketch – more details • Match phrases in text with entities in ontology and assign initial weight. • Graph overlay – add relationships from ontology between matched entities. • Mark / remove entities related to dates and places. • Add extracted relationships (NLP) between recognized entities. • Propagate entity weight in graph in similar way as in hubsauthorities algorithm [Kleinberg]. • Find thematic graph(s) for further analysis – connected component. • Calculate most important entities based on weight and graph centrality. • Find categories in schema that cover largest part of thematic component, are lowest in hierarchy and include most important entities. 19 Computer Science Department University of Georgia Experiments • Wikipedia ontology – Includes around 2,000,000 entries • Multiple entity names (variations for matching) – Has rich instance base (articles) – Internal href, templates and “infobox” relations carry semantic connections among entries – Has large schema with categories – over 310,00 categories • They DO NOT form a taxonomy, just a graph (even include cycles) 20 Computer Science Department University of Georgia Experiments (2) • Wikipedia 2 RDF – Created initially by dbpedia.org [Auer, Lehmann] – Creation of RDF – some modifications • Focus on href, infoboxes and templates – Special relationships for entities in infoboxes and templates • Only English version of Wikipedia • Entity name variations for matching – Name, short name (no brackets), redirect, disambiguation, alternate names 21 Computer Science Department University of Georgia Algorithm details (1) • Entity name matching – Entities and relationships are the content of document – they define topic(s). – Ontology defines known entities, literals or phrases assigned to them and classifications. – Analyzed text must contain some of these entities to be categorizable – otherwise it is outside of the ontology scope. – Matching assigns spotted phrases to known literals, and later to entities. • Possible use of stop words and/or stemming. 22 Computer Science Department University of Georgia Example of entity matching Ford Motor Co. is in the process of selling Ford Motor Company Process (computing) Business process Process (science) Sales Jaguar and Land Rover, according to Ford Jaguar (animal) Jaguar Cars Ltd. CEO Chief Executive Officer Land_Rover Ford Motor Company Alan Mulally. Alan_Mulally 23 Computer Science Department University of Georgia Algorithm details (2) • Semantic graph construction – Add relationships between recognized entities from ontology, as ontology defines meaningful (semantic) connections between them. – Add relationships extracted from NLP analysis of annotated text. – Connected entites enable to perform graph analysis, connectivity, finding paths, etc. • Date and place elimination – Dates and places are orthogonal to topic. – Path connecting entities through place or date is very little meaningful for document topic. 24 Computer Science Department University of Georgia Example – parse tree and triples Ford Motor Co. is in the process of selling Jaguar and Land Rover, according to Ford CEO Alan Mulally. 25 Computer Science Department University of Georgia Example – NLP + ontology knowledge Ford Motor Co. is in the process of selling Jaguar and Land Rover, according to Ford CEO Alan Mulally. named_after Jaguar (animal) Jaguar Cars Chief Executive Officer parent_company is_a has_CEO Ford Motor Company CEO_of Alan Mulally sells sells parent_company Land Rover 26 Computer Science Department University of Georgia Algorithm details (3) • Weight propagation – Each entity has its initial weight assigned by strength of phrase matching. – Like in the web, entities are interconnected influence each other. – We are looking for ‘authority’ entities – assumption is they are most representative for topic. 27 Computer Science Department University of Georgia Algorithm details (4) • Thematic subgraph in matched graph – Assumption is that entities associated with the same or related topics are interconnected in ontology – same as in real life. – Graph component = topic-related entites. – Each document (or document fragment) should treat about one or two main topics – leave only most important (weight) and largest component(s). 28 Computer Science Department University of Georgia Thematic graph examples Chief Executive Officer Jaguar Cars Jaguar (animal) Ford Motor Company Alan Mulally Land Rover Announcement Sales Business Buyer News Newspaper 29 Computer Science Department University of Georgia Algorithm details (5) • Most important and central entities – Topic tends to center around few entites that are either most important (weight) or are most central in graph. – Also classification of whole subgraph should be a subset of possible classification of these entities. 30 Computer Science Department University of Georgia Algorithm details (6) • Categorization – Category is defined as set and/or hierarchy of classes defined in ontology schema. – Each entity has a hierarchy of assigned categories. – Best ontology class for graph should: • • • • Cover maximum number of entities in the graph. Be on relatively lowest level in hierarchy. Be close in hierarchy to classified entity. Include most important entities (the more, the better) 31 Computer Science Department University of Georgia Entities and categories Car Manufacturers Felines Living people Ford Off-road wehicles Pantherinae Ford people Jaguar Panthera Ford executives Jaguar Cars Alan Mulally Jaguar (animal) Ford Motor Company Chief Executive Officer Land Rover 32 Computer Science Department University of Georgia Longer example Ford, utility ready to work on plug-in car Automaker, Southern California Edison to unveil alliance in response to demand for energy-efficient vehicles. DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison will announce an unusual alliance Monday aimed at clearing the way for a new generation of rechargeable electric cars, the companies said. Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International (Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with reporters at Edison's headquarters in Rosemead, Calif., the companies said. [...] Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles uses batteries to power the vehicle at low speeds and in to provide assistance during stop-and-go traffic and hard acceleration, delivering higher fuel economy. General Motors Corp. (Charts , Fortune 500) has already begun work this year to develop its own plug-in hybrid car, designed to use little or no gasoline over short distances. The company showed off a concept version of the Chevrolet Volt in January at the Detroit Auto show and has awarded contracts to two battery makers 33 to research advanced batteries for a possible production version. Computer Science Department University of Georgia Longer example Ford, utility ready to work on plug-in car Automaker, Southern California Edison to unveil alliance in response to demand for energy-efficient vehicles. DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison will announce an unusual alliance Monday aimed at clearing the way for a new generation of rechargeable electric cars, the companies said. Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International (Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with reporters at Edison's headquarters in Rosemead, Calif., the companies said. [...] Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles uses batteries to power the vehicle at low speeds and in to provide assistance during stop-and-go traffic and hard acceleration, delivering higher fuel economy. General Motors Corp. (Charts , Fortune 500) has already begun work this year to develop its own plug-in hybrid car, designed to use little or no gasoline over short distances. The company showed off a concept version of the Chevrolet Volt in January at the Detroit Auto show and has awarded contracts to two battery makers 34 to research advanced batteries for a possible production version. Computer Science Department University of Georgia 35 Computer Science Department University of Georgia Longer example graph properties • • • • • Initial number of vertexes: 205 Initial number of edges : 361 Largest component : 95 Component for analysis : 35 Central and most important entities: – Hybrid_vehicle * Centrality 208, * weight 1.516873 – Automobile * Centrality 213, weight 1.249790, – Internal_combustion_engine * Centrality 233, weight 1.069511 – Ford_Motor_Company Centrality 237, * weight 1.451533, – Southern_California_Edison Centrality 351, * weight 1.308824 36 Computer Science Department University of Georgia Longer example categories • Category:Automobiles – CAT instances <13>, (avg. height 2.384615) weight [0.874697] • Category:Alternative_propulsion – CAT instances <4>, (avg. height 1.250000) weight [0.873287] • Category:Car_manufacturers – instances <3> (avg. height 1.000000) weight [0.781271] • Category:Vehicles – CAT instances <13>, (avg. height 2.923077) weight [0.647903] • Category:Transportation – CAT instances <11>, (avg. Height 3.090909) weight [0.629714] 37 Computer Science Department University of Georgia Wikipedia categories • Wikipedia categories DO NOT form a taxonomy – It is just a directed graph, that contains cycles. – Not possible to use subsumption for categories. – Thesaurus-like structure. [Voss] • Categories may be very deep and detailed, or very broad – Hard to pinpoint the cut-off point good for categorization. – There is no simple mapping between news categories and categories in Wikipedia. 38 Computer Science Department University of Georgia Overall performance of initial tests • Tests against classic BOW statistic classifier [McCallum]. • Source articles and categories taken from CNN – total of 7158 documents in 14 categories. – Divided into 50% training / 50% testing split • Mapping between Wikipedia and CNN categories done manually by crawling generated Wikipedia schema (still not really precise) 39 Computer Science Department University of Georgia Text corpora – CNN news 40 Computer Science Department University of Georgia CNN and Wikipedia • CNN categories – Classified by people – Describe mostly article interest, not necessarily its content • Frequently described reader’s interest rather than true subject. – Hard to match to Wikipedia categories • Wikipedia categories – Content-based – Very detailed and deep 41 Computer Science Department University of Georgia Categorization results - BOW 42 Computer Science Department University of Georgia Categorization results – BOW on Wikipedia 43 Computer Science Department University of Georgia Categorization results - Wikipedia 44 Computer Science Department University of Georgia Summary of work • Ontology storage and querying – Brahms RDF/S storage – Sparqler – query language extension with path queries • For use in Glycomics project • Prototype of ontology-based categorization – Partial implementation – not all modules included yet – Use of general-purpose ontology – RDF graph created from English Wikipedia – Initial tests confirm proof of concept – Published as technical report, submitted to WWW 2008 45 Computer Science Department University of Georgia Remaining research • Goal – Create comprehensive model for ontologybased categorization. • Create semantic context definition • Modify and/or create graph similarity measures that exploit context information 46 Computer Science Department University of Georgia Current work in progress • Goal – Create a system, where user can categorize text document with given ontology using specified semantic context. • NLP module for relationship extraction • Definition of query context – Extension of SPARQL with context queries 47 Computer Science Department University of Georgia Proposed work • Include NLP analysis in creating relationships between entities – Will help to link entities that do not have connection in ontology or strengthen this connection. • Explore categorization to a user-defined context (collection of instances, classes, structures, path expressions). • Extend definition of category to include context. • Experiment with other well-developed ontologies to categorize more specialized documents – Eg. PubMed • (optional) Study the applicability of the method for ontology-based document summarization. 48 Computer Science Department University of Georgia Published papers • • • • • Maciej Janik, Krys Kochut. "BRAHMS: A WorkBench RDF Store And High Performance Memory System for Semantic Association Discovery", Fourth International Semantic Web Conference, ISWC 2005, Galway, Ireland, 610 November 2005 Krys Kochut, Maciej Janik. "SPARQLeR: Extended Sparql for Semantic Association Discovery", Fourth European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, 3-7 June 2007 Matthew Perry, Maciej Janik, Cartic Ramakrishnan, Conrad Ibanez, Budak Arpinar, Amit Sheth. "Peer-to-Peer Discovery of Semantic Associations", Second International Workshop on Peer-to-Peer Knowledge Management, San Diego, CA, July 17, 2005 Maciej Janik, Krys Kochut. "Wikipedia in action: Ontological Knowledge in Text Categorization", UGA Technical Report No. UGA-CS-TR-07-001, November 2007 – submitted to WWW 2008 S. Nimmagadda, A. Basu, M. Evenson, J. Han, M. Janik, R. Narra, K. Nimmagadda, A. Sharma, K.J. Kochut, J.A. Miller and W. S. York, "GlycoVault: A Bioinformatics Infrastructure for Glycan Pathway Visualization, Analysis and Modeling," Proceedings of the 5th International Conference on Information Technology: New Generations (ITNG'08), Las Vegas, Nevada (April 2008) [to appear] 49 Computer Science Department University of Georgia References • • • • • • • • • Auer, S. and Lehmann, J., What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content. in European Semantic Web Conference (ESWC'07), (Innsbruck, Austria, 2007), Springer, 503-517. Gamon, M., Graph-Based Text Representation for Novelty Detection. in Workshop on TextGraphs at HLT-NAACL 2006, (New York, NY, US, 2006). Gruber, T. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5 (2). 199-220, 1993. Kleinberg, J.M., Authoritative Sources in a Hyperlinked Environment. in ACM-SIAM Symposium on Discrete Algorithms, (1998). McCallum, A.K. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996. Nagarajan, M., Sheth, A.P., Aguilera, M., Keeton, K., Merchant, A. and Uysal, M. Altering Document Term Vectors for Classification - Ontologies as Expectations of Cooccurrence LSDIS Technical Report, November, 2006. Schenker, A., Bunke, H., Last, M. and Kandel, A. Graph-Theoretic Techniques for Web Content Mining. World Scientific, London, 2005. Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34 (1). 1 - 47. Voss, J. Collaborative thesaurus tagging the Wikipedia way. ArXiv Computer Science e-prints, cs/0604036. 50