Semantic Lifting for Traditional Content Resources Semantic CMS Community Lecturer Organization Date of presentation Co-funded by the European Union 1 Copyright IKS Consortium Page: Part I: Foundations (1) Introduction of Content Management Part II: Semantic Content Management (3) Knowledge Interaction and Presentation (2) Foundations of Semantic Web Technologies Part III: Methodologies (7) Requirements Engineering for Semantic CMS Representation (4) Knowledge and Reasoning (8) Designing Semantic CMS (5) Semantic Lifting (9) Semantifying your CMS (6) Storing and Accessing Semantic Data (10) www.iks-project.eu Designing Interactive Ubiquitous IS Copyright IKS Consortium Page: 3 What is this Lecture about? We ... how to build ontologies representing complex knowledge domains. ... a way to reason about knowledge. We have learned ... Part II: Semantic Content Management (3) Knowledge Interaction and Presentation Representation (4) Knowledge and Reasoning need a way ... ... to extract knowledge from content in a automatic way Semantic Lifting www.iks-project.eu (5) Semantic Lifting (6) Storing and Accessing Semantic Data Copyright IKS Consortium Page: 4 Overview What is semantic lifting? Core concepts Scenarios Requirements Technologies Semantic Reengineering Semantic Enhancements of textual content www.iks-project.eu Copyright IKS Consortium Page: 5 What is “Semantic Lifting”? Semantic Lifting refers to the process of associating content items with suitable semantic objects as metadata to turn “unstructured” content items into semantic knowledge resources Semantic Lifting makes explicit “hidden” metadata in content items www.iks-project.eu Copyright IKS Consortium Page: 6 Semantic Lifting Targets Semantic Semantic Lifting harmonizes metadata representations Semantic Lifting reengineers data from an existing resource so that the data from the resource can be reused within in a semantic repository Semantic Reengineering of structured data Content Enhancement Semantic Lifting generates additional metadata and annotations by semantic analysis of content items Semantic Lifting classifies content objects by means of semantic annotations www.iks-project.eu Copyright IKS Consortium Page: 7 Structured Content Structured content provides implicit semantics through the structure definition Table definitions in relational databases, XML schemata, field definitions for adressbooks, calendars, etc. Application programs are designed to „know“ how to interpret the structures and the data within. Semantic Lifting is used for Reengineering to support data exchange and seamless interoperability between different systems www.iks-project.eu Copyright IKS Consortium Page: 8 Unstructured Content Unstructured content Images, texts, videos, music, web pages composed of various types of media items Meaningful only to humans not to machines Content must be described semantically by metadata to become meaningful to machines, e.g. what the text or image is about. Semantic Lifting is used as content enhancement www.iks-project.eu Copyright IKS Consortium Page: 9 Mixed Content No dichotomy of structured and unstructured content Structured databases are used to store unstructured content types, such as texts, images etc. Documents can be composed of unstructured content items such as free text and images as well as more structured information, e.g. tables and charts Free text Structured content www.iks-project.eu Copyright IKS Consortium Page: 10 Metadata: Variants Metadata exist in many forms: Free text descriptions Descriptive content related keywords or tags from fixed vocabularies or in free form Taxonomic and classificatory labels Media specific metadata, such a mime-types, encoding, language, bit rate Media-type specific structured metadata schemes such as EXIF for photos, IPTC tags for images, ID3-tags for MP3, MPEG-7 for videos, etc. Content related structured knowledge markup, e.g. to specify what objects are shown in an image or mentioned in a text, what the actors are doing, etc. www.iks-project.eu Copyright IKS Consortium Page: 11 Metadata: Variants Inline metadata are part of content ID3 tags embedded in MP3 files Offline metadata are kept separate from content www.iks-project.eu Copyright IKS Consortium Page: 12 Formal semantic metadata Data representation in a formalism with a formal semantic interpretation that defines the concept of (logical) entailment for reasoning: Soundness: conclusions are valid entailments Completeness: every valid entailment can be deduced Decidability: a procedure exists to determine whether a conclusion can be deduced Embodiments: Logics Knowledge Representation Systems, Description Logics Semantic www.iks-project.eu Web: RDF, OWL Copyright IKS Consortium Page: 13 „Semantics“ in CMS CMS systems provide various methods to include metadata Organize content in hierarchies Hierarchical taxonomies Attachment of properties to content items for metadata Content type definitions with inheritance These methods are used in CMS systems in ad-hoc fashion without clear semantics. Therefore no welldefined reasoning is possible. www.iks-project.eu Copyright IKS Consortium Page: 14 Semantic Lifting Usage Content Creation and Acquisition Authoring content Uploading external content/documents automatic extraction and analysis, e.g. for indexing Importing content from external sources/documents Support content editors in providing metadata of specified types Integration of external content into content repository Content needs to be transformed to match internal CMS structures and metadata schemes Crossreferencing/linking among CMS content items and external content Detect related or additional content Add pointers/links to related or additional content www.iks-project.eu Copyright IKS Consortium Page: 15 Semantic Lifting Usage Access to external documents and content repositories Semantic harmonization with CMS semantic structures Semantic interoperability in data exchange with other content repositories The CMS needs to understand the data structures used by external services and programs E.g synchronization of a local calendar from Outlook with an external calendar based on iCalendar format E.g. Importing RDF from a Linked Data endpoint such as dbpedia The CMS must present its data in a form understood by external target services or programs www.iks-project.eu Copyright IKS Consortium Page: 16 Semantic Lifting Usage Publishing content with metadata Metadata need to be transformed into a form compatible with the publication format E.g. converting FreeDB metadata into ID3 tags for inclusion in an MP3 file www.iks-project.eu Copyright IKS Consortium Page: 17 Publishing Web Content with semantic metadata Augmenting web content with structured information becomes increasingly important Several methods have emerged in recent years to include structured metadata in Web pages Microformats RDFa Microdata (HTML5) Supported by the major search engines to improve search and result presentation, e.g. Google („Rich Snippets), Bing, Yahoo www.iks-project.eu Copyright IKS Consortium Page: 18 Augmenting Web Content The HTML code contains a review of a restaurant in plain text using only line breaks for structuring Without specialized information extraction analysis tools it cannot be interpreted, e.g. that it is a review (of what and when?), who the reviewer was, etc. <div> L’Amourita Pizza Reviewed by Ulysses Grant on Jan 6. Delicious, tasty pizza on Eastlake! L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint. Rating: 4.5 </div> www.iks-project.eu Copyright IKS Consortium Page: 19 Microformats Same text but additional span elements with class attributes to encode the type of contained information (hReview) and the properties of that type <div class="hreview"> <span class="item"> <span class="fn">L’Amourita Pizza</span> </span> Reviewed by <span class="reviewer">Ulysses Grant</span> on <span class="dtreviewed"> Jan 6<span class="value-title" title="2009-01-06"></span> </span>. <span class="summary">Delicious, tasty pizza on Eastlake!</span> <span class="description">L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint.</span> Rating: <span class="rating">4.5</span> </div> www.iks-project.eu Copyright IKS Consortium Page: 20 RDFa Same text but additional attributes and span elements encoding a RDF structure: namespace declaration of the used ontology RDF class encoded by typeof attribute and its properties by a property attribute <div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed">L’Amourita Pizza</span> Reviewed by <span property="v:reviewer">Ulysses Grant</span> on <span property="v:dtreviewed" content="2009-01-06">Jan 6</span>. <span property="v:summary">Delicious, tasty pizza on Eastlake!</span> <span property="v:description">L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint.</span> Rating: <span property="v:rating">4.5</span> </div> www.iks-project.eu Copyright IKS Consortium Page: 21 Microdata (HTML5) Same text but additional attributes and span elements: A class declaration as value of an itemtype attribute and its properties as values of an itemprop attribute <div> <div itemscope itemtype="http://data-vocabulary.org/Review"> <span itemprop="itemreviewed">L’Amourita Pizza</span> Reviewed by <span itemprop="reviewer">Ulysses Grant</span> on <time itemprop="dtreviewed" datetime="2009-01-06">Jan 6</time>. <span itemprop="summary">Delicious, tasty pizza in Eastlake!</span> <span itemprop="description">L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint.</span> Rating: <span itemprop="rating">4.5</span> </div> </div> www.iks-project.eu Copyright IKS Consortium Page: 22 Lifting Requirements: Overview Top-level requirements Semantic Associations with Content Semantic Harmonization Semantic Linking Interactive Lifting Customizability Semantically Transparent Structured Content Sources www.iks-project.eu Copyright IKS Consortium Page: 23 Semantic Associations with Content Unstructured content and information must be supplied with structured semantic annotations and metadata. Support for various content/media types Information extraction from text, topic classification, image tagging, … Support for creation of semantic annotations in content authoring www.iks-project.eu Copyright IKS Consortium Page: 24 Semantic Harmonization Metadata and annotations must be harmonized with requirements for semantic processing in the CMS Reengineering methods, interpreters and wrappers for all types and formats of metadata and annotations, e.g. tags, microformats, XML Metadata ( MPEG-7, …), ID3 tags, EXIF data, … Ensure semantic interoperability of data and annotation schemes within the CMS and across external resources Ontology mapping and harmonization of annotations External metadata Metadata generated by semantic analysis www.iks-project.eu Copyright IKS Consortium Page: Slide 25 Semantic Linking Lifting must enable the interlinking of content objects by semantic relationships. Internal linking of content items within the CMS links to external resources, e.g. Linked Open Data Establish semantic relatedness of content for different views as well as different search, navigation and browsing strategies, … Direct semantic links among content items and metadata Similarity relations over sets of content items Clustering of content items www.iks-project.eu Copyright IKS Consortium Page: Slide 26 Interactive Lifting Lifting must interact with CMS users. Suggest semantic annotations during content creation Support for various publishing formats such as microformats, RDFa, etc. Automatic annotations (autotagging) with optional correction option Learning capabilities and adaptability of automatic annotation components from user feedback www.iks-project.eu Copyright IKS Consortium Page: 27 Customizability Lifting components must be customizable by CMS users/customers. Users must not be restricted to predefined vocabularies, ontologies, … Domain ontologies, terminologies, tag sets are defined by CMS users/customers. Browsers and editors for component resources are necessary. www.iks-project.eu Copyright IKS Consortium Page: 28 Transparent Structured Content Sources Structured content sources need to be reengineered to semantic resources Support uniform data access to structured content repositories, e.g. SPARQL end points based on D2RQ technologies for transparent access to RDF and non-RDF databases Extraction of ontologies from database structures, schemata, XML, resources, … Alignment and mapping of the descriptions www.iks-project.eu Copyright IKS Consortium Page: 29 Semantic Reengineering of structured data sources Focus on tools for reengineering structured data sources to RDF representations Many tools and platforms for D2R Servers: Exhibit relational DBs as RDF Talis platform: Linked Open Data Triplify: like D2R but in PHP Virtuoso middleware Krextor/OntoCape: generating RDF from XML Various Transformers for inducing RDF ontologies and instance data from XSD and XML More details in presentation on Knowledge Representation (KReS) www.iks-project.eu Copyright IKS Consortium Page: 30 Semantic Content Enhancements: Overview Focus here is on textual content Metadata Extraction from existing content in various formats to make embedded metadata explicit Information Extraction from textual content: Named Entities Coreference Relationships Classification and Clustering of content items Statistical methods and tools Semantic classification based on ontological definitions www.iks-project.eu Copyright IKS Consortium Page: 31 Information Extraction Rule based approaches for shallow text analysis Usually based on Finite State technology: fast, robust Cascaded processing Based on templates as target structures to be filled Example platforms: GATE SProUT Can be used for nearly any kind of extraction/annotation task, including Named-Entity-Recognition (NER) Easy customization www.iks-project.eu Copyright IKS Consortium Page: 32 Information Extraction Semi-supervised learning approaches Rule induction from corpora Use example annotations as seeds for bootstrapping Pattern Rules learned from contextual features with generalization over contexts www.iks-project.eu Copyright IKS Consortium Page: 33 Named Entities Statistical Approaches: examples Lingpipe: Hidden Markov Models OpenNLP: Maximum Entropy Models Stanford NER: Conditional Random Fields Statistical models crated by supervised learning techniques Large annotated corpora required Customization diffcult except by re-annotation/re-training Not suitable for any type of named entity www.iks-project.eu Copyright IKS Consortium Page: 34 NER Document Markup www.iks-project.eu Copyright IKS Consortium Page: 35 NER Markup for a Web Page www.iks-project.eu Copyright IKS Consortium Page: 36 IE Template A Person Template (as Typed Featured Structure) instantiated from text. The template supports the extraction of various properties of a person. www.iks-project.eu Copyright IKS Consortium Page: 37 Classification Assign a data item to some predefined class Statistical classification Numerous methods, e.g.: Bayes classifiers K-Nearest Neighbor (KNN) Support Vector Machines (SVM) www.iks-project.eu Copyright IKS Consortium Page: 38 Semantic Classification Semantic classification in Knowledge Representation Formalisms Infer the item‘s class from the item‘s properties by matching them with the class definitions: Which classes allow for these properties? Assume that our ontology contains 2 classes with some properties SpatialThing: PopulatedPlace: latitude, longitude population Paderborn is an object with latidude „51°43′0″N“, longitude „8°46′0″E“ and a population of 146283. Then we can infer that Paderborn is a SpatialThing as that are the things that have latitudes and longitudes in our ontology. Also, we can infer that it is a PopulatedPlace as that are the things that have a population. www.iks-project.eu Copyright IKS Consortium Page: 39 Clustering Detection of classes in a data set Partitioning data into classes in an unsupervised way with high intra-class similarity low inter-class similarity Main variants: Hierarchical clustering Agglomerative Partitioning clustering K-Means www.iks-project.eu Copyright IKS Consortium Page: 40 Tools for Classification and Clustering Generic: WEKA: Java library implementing several dozen methods for data mining. Application to textual data requires special preprocessing. Text: MALLET: Java library with implementations of major methods for text and document classification and clustering www.iks-project.eu Copyright IKS Consortium Page: 41 Evaluation Measures Standard evaluation measures for IE/IR etc. systems: tp tn Accuracy: acc tp fp tn fn tp Precision: prec tp fp tp recall Recall: tp fn prec recall F-Measure : F 2 prec recall www.iks-project.eu tp = true positive tn = true negative fp = false positive fn = false negative Copyright IKS Consortium Page: 42 Evaluation Measures: Classification A confusion matrix which reports on the classification of 27 wines by grape variety. The reference in this case is the true variety and the response arises from the blind evaluation of a human judge. =9/(9+3+1) Many-way Confusion Matrix Response Cabernet Syrah Pinot Precision Recall F-Measure Refer- Cabernet 9 3 0 0,69 0,75 0,72 ence Syrah 3 5 1 0,56 0,56 0,56 Pinot 1 1 4 0,80 0,67 0,73 Macro average 0,68 0,66 0,67 Overall accuracy 0,67 =4/(1+1+4) www.iks-project.eu Copyright IKS Consortium Page: 43 Evaluation Measures: NER Reference annotations: [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today Recognized annotations: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] -> Microsoft Corp. CEO Steve Ballmer announced the release of Windows 7 today Precision: 1/(1+3) = 0,25 Recall: 1/(1+2) = 0,33 F-Measure: 2*0,25*0,33/(0,25+0,33) = 0,28 www.iks-project.eu Counts Entities 1 [Microsoft Corp.] FP 3 [CEO] [Steve] [today] FN 2 TP TN [Windows 7] [Steve Ballmer] Copyright IKS Consortium Page: 44 NER Evaluation Nobel Prize Corpus from NYT, BBC, CNN 538 documents (Ø 735 words/document) 28948 person, 16948 organization occurrences Sprout Calais Stanford NER OpenNLP Precision 77,26 94,22 73,21 57,69 Recall 65,85 86,66 73,62 42,86 F1 71,10 90,28 73,41 49,18 www.iks-project.eu Copyright IKS Consortium Page: 45 References Microformats: http://microformats.org/ RDFa: http://www.w3.org/TR/xhtml-rdfa-primer/ Google Rich Snippets: http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html Linked Data: http://linkeddata.org/guides-and-tutorials Linked Data: Heath and Bizer, Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011. (Online: http://linkeddatabook.com/book) Information Extraction: Moens, Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer 2006 Text Mining: Feldman and Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, CUP, 2007 www.iks-project.eu Copyright IKS Consortium