1162013 Lecture notes for INFO116 Lecture 3: Adding Semantics to data Dr. Csaba Veres, Institute for Information and Media Science, The University of Bergen In order to make any data (temperature, location, etc.), Web resource (image, audio file, video stream, etc.), or real-world entity (painting, IP camera, can of milk, airline reservation, etc.), collectively abbreviated as objects, machine understandable, we need to describe their characteristics/ associated information in a machine processable manner. This requires determining what aspects of the objects to abstract and describe, the semantic modeling problem, and how to materialize these descriptions in a machine processable form that can elicit semantic comprehension and facilitate interoperability, the representation and reasoning problem. Different forms of data and their semantics • Unstructured data ⁃ Grammatical text 1 1162013 Grammatical text is exemplified by articles appearing in newspapers (such as those in The Wall Street Journal, or The New York Times), research journals (such as those in PubMed), magazines, and books. The text is characterized by the use of full sentences satisfying grammatical constraints. Automatic abstraction and assimilation of content can benefit from linguistic (syntactic and semantic) structure as well as domain-semantics provided by background knowledge (e.g., in bioinformatics/medical, economics, and technology). There is no technology that can “understand” text at a level of a human reader, so the semantics which can be extracted from unstructured data involves named entity extraction, extraction of relationships, and “understanding” a restricted set of inputs (e.g. Siri). Information extraction from grammatical text often makes use of syntactic patterns and contexts to recognize entity types (such as person, place, currency, and date) and relationships. For example, from a passage like: (1) The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for GeorgiaPacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia- Pacific in Atlanta. Extract a table like: 2 1162013 (From NLTK toolkit documentation http://nltk.org/book/ch07.html) Information Extraction Architecture A lot of academic work on text analysis has now culminated in the development of UIMA (Unstructured Information Management Architecture) that provides a common framework for content analytics (specifically, processing information to extract meaning and create structured data). 3 1162013 Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes. Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization). http://uima.apache.org ⁃ Application specific user generated content User-generated content is exemplified by citizen microblogs (such as from Twitter) and exchanges on social media (such as on Facebook). (Even though blogs fall under UGC, their structure is closer to grammatical text than the 4 1162013 text found on Twitter and Facebook, so they are excluded.) The UGC text is characterized by being terse, colloquial, context-poor, and containing words/ phrases with nonstandard/creative, homophone-based spellings. Automatic assimilation of content requires background knowledge (in the form of ontologies) to deal with implicit spatio-temporal-thematic context, and novel techniques and tools to summarize and visualize aggregated content to deal with its large size and real-time nature. Robust, scalable, and high-quality analysis of user-generated content is an active area of contemporary research. E.g. Sentiment analysis http://www.sentiment140.com http://sentiwordnet.isti.cnr.it • Semi structured data Semi-structured data are normally text data that contain tags or other markers to delimit and capture hierarchical and aggregation structure associated with semantic fields within the data. HTML documents1 and XML documents with domain-specific tags (such as DBLP XML records) are classic examples of semi-structured data. Typically, an XML document consists of tagged text (that is, markup inter-twined with text) in which the tag makes explicit the category and the properties of the enclosed text using attribute-value pairs. Even though the 5 1162013 tags in semi-structured data are syntactic in nature, they are conducive to incorporating semantics in that the tags can be chosen from a standard semantic model or potentially mapped to one. For example, the tags can be display- oriented providing information such as color and font, or semanticsoriented providing information such as type and normal-form. Semantic tags permit abstraction of meaning in a standard form. In summary, semistructured documents can contain descriptive text that is human comprehensible and semantic tags that are machine processable. • Structured data Structured data have well-defined formal syntax and an associated data model. Relational databases, machine sensor data stream (such as from sensors on a weather station), RDF triples, and XML serialization of a Webservice call (for data interchange) all illustrate the variety in structured data. By definition, structured data can be parsed with relative ease. To provide semantics, we need to associate real-world objects/entities and their relationships with the structured data on the basis of their schema. To comprehend machine sensor data, it is also necessary to incorporate spatiotemporal-thematic context using information about time of measurement, location, and type of sensor, physical quantity measured, unit of measure, etc. • Multimedia data We lump all other non-textual content together as multimedia data, which includes photos, videos, and audio clips found on social-media sharing services such as Flickr and YouTube. Due to the difficulties involved in analyzing multimedia content, the only viable approach to gleaning semantics is through the analysis of associated textual descriptors (metadata) and community comments. Techniques for adding semantics to data 6 1162013 • MICROFORMATS A microformat is an approach to semantic markup that seeks to reuse existing HTML/XHTML tags to incorporate metadata. Specifically, it embeds and encodes semantics within the attributes of markup tags. This approach allows contact information, geographic coordinates, and calendar events, etc., to be both human readable and machine processable (using hCard, geo, and hCalendar microformats respectively) as illustrated below. The text:“Dayton, OH is located at 39.59, -84.22.” can be annotated using geo-microformat as: Dayton , OH is located at <span class ="geo"> <span class =" latitude ">39.59</span >, <span class="longitude">−84.22</span> </ span > The HTML fragment <div > <div >Amit Sheth </ div > <div>Kno.e.sis Center</div> <div>937−775−5217</div> <a href="http :// knoesis .org/">http :// knoesis .org/</a> </ div > can be semantically enhanced using the hCard microformat markup as <div class=card"> <div class="fn">Amit Sheth</div> <div class="org">Kno.e.sis Center</div> <div class =" tel ">937−775−5217</div > <a class="url" href="http://knoesis.org/">http://knoesis.org/</a> </ div > The class attribute values fn, org, tel, and url stand for formatted name, organization, telephone number, and Uniform Resource Locator respectively. hCard is an HTML version of the vCard standard. Microformats is a commuity driven effort to extend semantics to well specified, precise domains, e.g. ⁃ hCalendar - events 7 1162013 ⁃ hCard - people, organizations, contacts ⁃ rel-license - licensed content ⁃ rel-nofollow - links in untrusted 3rd party content ⁃ rel-tag - tag posts and pages by subject ⁃ XFN - social relationships and rel-me links among profiles for the same person ⁃ XMDP - define a microformat vocabulary / profile ⁃ XOXO - outlines The microformats principles ⁃ Solve a specific problem ⁃ Start as simple as possible ⁃ Design for humans first, machines second ⁃ Reuse building blocks from widely adopted standards ⁃ Modularity / embeddability ⁃ Enable and encourage decentralized development, content, services • RDFA RDFa (Resource Description Framework - in - attributes) adds a set of attribute-level extensions to XHTML for embedding rich metadata within Web documents. RDFa is a W3C-proposed standard and a markup language that enables the layering of RDF information on any XHTML or XML document. RDFa provides a set of attributes that can represent semantic metadata within an XML language from which RDF triples can be extracted using simple mapping. Use of RDFa improves traceability and minimizes duplication of information in comparison with an approach that maintains a separate translation/abstraction of a document into RDF. 8 1162013 The core subset of RDFa attributes include • about—a URI extracted as the subject of an RDF triple that specifies the resource the metadata is about; • rel and rev—extracted as the object property (predicate) of an RDF triple, this URI specifies a relationship or reverse-relationship with another resource; • href, src, and resource—extracted as the object of an RDF triple, this URI specifies the partner resource; • property—extracted as the datatype property (predicate) of an RDF triple, this URI specifies a property for the content of an element; and instanceof—extracted as the object property “rdf:type” coupled with an RDF triple’s object, this optional attribute specifies the RDF type of the subject (the resource that the metadata is about). One of the biggest users of RDFa is Facebook, with Open Graph Protocol 9 1162013 (OGP). The Open Graph protocol enables any web page to become a rich object in a social graph by specifying metadata terms such as title, type, image, and URL <div prefix="og: http://ogp.me/ns# dc: http://purl.org/dc/elements/1.1/" about="/photos"> <h2 property="og: title">Hubble Space Telescope ’s Top Images</h2> Sombrero Galaxy (M104) is an edge−on spiral galaxy , 50,000 light−years across and 28 million light years from Earth. <div about="http://example.com/photos/sombrero.jpg"> <img src="http://example.com/photos/sombrero.jpg" /> <span property="dc:title">The Majestic Sombrero Galaxy (M104) </span > <span property ="dc:creator ">Hubble Space Telescope </span > </div > </div > The above RDFa example shows use of Open Graph Protocol and Dublin Core vocabularies (including an association of a title and a creator to an image). The Dublin Core specifies metadata terms to describe Web and physical resources such as title, creator, publisher, and date. The resulting extracted RDF triples are shown below. < h t t p : / / e x a m p l e . com / p h o t o s > <http://ogp.me/ns#title > "Hubble Space Telescope ’s Top Images" < h t t p : / / e x a m p l e . com / p h o t o s / s o m b r e r o . j p g > <http://purl.org/dc/elements/1.1/title >"The Majestic Sombrero Galaxy (M104)" < h t t p : / / e x a m p l e . com / p h o t o s / s o m b r e r o . j p g > <http:// purl.org/dc/elements/1.1/creator> "Hubble Space Telescope" http://www.w3.org/2012/pyRdfa/#distill_by_input : 10 1162013 @prefix dc11: <http://purl.org/dc/elements/1.1/> . @prefix og: <http://ogp.me/ns#> . </photos> og:title "Hubble Space Telescope ’s Top Images" . <http://example.com/photos/sombrero.jpg> dc11:creator """Hubble Space Telescope """; dc11:title "The Majestic Sombrero Galaxy (M104) " . RDFa can be quite complicated because it is relatively expressive. As a result, a competing proposal Microdata has emerged. • MICRODATA Microdata is a proposed HTML5 specification used to embed semantics within Web documents. It is an attempt to provide a simpler alternative to approaches such as RDFa and Microformats for annotating HTML elements, to enable applications such as search engines and Web crawlers to better assimilate web page content. It is an essential feature of the schema.org effort. <section itemscope itemtype="http://example.org/animals#cat"> <h1 itemprop="name http :// example .com/fn">Hedral </h1> <p itemprop="desc">Hedral is a male american domestic shorthair , with a fluffy <span itemprop="http://example.com/color">black</span> fur with <span itemprop="http://example.com/color">white</span> paws and belly. </p> <img itemprop="img" src="hedral.jpeg" alt="" title="Hedral, age 18 months"> </section > schema.org markup with Microdata 11 1162013 Let's start with a concrete example. Imagine you have a page about the movie Avatar—a page with a link to a movie trailer, information about the director, and so on. Your HTML code might look something like this: <div> <h1>Avatar</h1> <span>Director: James Cameron (born August 16, 1954)</span> <span>Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html">Trailer</a> </div> To begin, identify the section of the page that is "about" the movie Avatar. To do this, add the itemscope element to the HTML tag that encloses information about the item, like this: <div itemscope> <h1>Avatar</h1> <span>Director: James Cameron (born August 16, 1954) </span> <span>Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html">Trailer</a> </Div> By adding itemscope, you are specifying that the HTML contained in the <div>...</div> block is about a particular item. But it's not all that helpful to specify that there is an item being discussed without specifying what kind of an item it is. You can specify the type of item using the itemtype attribute immediately after the itemscope. <div itemscope itemtype="http://schema.org/Movie"> <h1>Avatar</h1> <span>Director: James Cameron (born August 16, 1954)</span> <span>Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html">Trailer</a> </div> 12 1162013 This specifies that the item contained in the div is in fact a Movie, as defined in the schema.org type hierarchy. Item types are provided as URLs, in this case http://schema.org/Movie. What additional information can we give search engines about the movie Avatar? Movies have interesting properties such as actors, director, ratings. To label properties of an item, use the itemprop attribute. For example, to identify the director of a movie, add itemprop="director" to the element enclosing the director's name. (There's a full list of all the properties you can associate with a movie at http://schema.org/Movie.) <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Avatar</h1> <span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span> <span itemprop="genre">Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a> </div> Note that we have added additional <span>...</span> tags to attach the itemprop attributes to the appropriate text on the page. <span> tags don't change the way pages are rendered by a web browser, so they are a convenient HTML element to use with itemprop. Search engines can now understand not just that http:// www.avatarmovie.com is a URL, but also that it's the URL for the trailer for the science-fiction movie Avatar, which was directed by James Cameron. Sometimes the value of an item property can itself be another item with its own set of properties. For example, we can specify that the director of the movie is an item of type Person and the Person has the properties name and birthDate. To specify that the value of a property is another item, you begin a new itemscope immediately after the corresponding itemprop. 13 1162013 <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name"&g;Avatar</h1> <div itemprop="director" itemscope itemtype="http://schema.org/ Person"> Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span> </div> <span itemprop="genre">Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a> </div> Microdata has had a difficult lifecycle in some circles, as it was possibly ill conceived as a competitor to the already established RDFa standard. http://manu.sporny.org/2013/microdata-downward-spiral/ : “A number of observers have been surprised by these events, but for those that have been involved in the month-to-month conversation around Microdata, it makes complete sense. Microdata doesn’t have an active community supporting it. It never really did. For a Web specification to be successful, it needs an active community around it that is willing to do the hard work of building and maintaining the technology. RDFa has that in spades, Microdata does not. Microdata was, primarily, a shot across the bow at RDFa. The warning worked because the RDFa community reacted by creating RDFa Lite, which matches Microdata feature-for-feature, while also supporting things that Microdata is incapable of doing. The existence of RDFa Lite left the HTML Working Group in an awkward position. Publishing two specifications that did the exact same thing in almost the exact same way is a position that no 14 1162013 standards organization wants to be in. At that point, it became a race to see which community could create the developer tools and support web developers that were marking up pages. Microdata, to this day, still doesn’t have a specification editor, an active community, a solid test suite, or any of the other things that are necessary to become a world class technology. To be clear, I’m not saying Microdata is dying (4 million out of 329 million domains use it), just that not having these basic things in place will be very problematic for the future of Microdata.” Manu has always maintained that the Microdata effort, together with schema.org is an effort to control web vocabulary, especially since Facebook already uses RDFa. http://manu.sporny.org/2011/false-choice/ • RDFA LITE http://www.w3.org/TR/rdfa-lite/ RDFa Lite is a minimal subset of RDFa, the Resource Description Framework in attributes, consisting of a few attributes that may be used to express machine-readable data in Web documents like HTML, SVG, and XML. While it is not a complete solution for advanced data markup tasks, it does work for most day-to-day needs and can be learned by most Web authors in a day. The full RDFa syntax [RDFA-CORE] provides a number of basic and advanced features that enable authors to express fairly complex structured data, such as relationships among people, places, and events in an HTML or XML document. Some of these advanced features may make it difficult for authors, who may not be experts in structured data, to use RDFa. This lighter version of RDFa is a gentler introduction to the world of structured data, intended for authors that want to express fairly simple data in their web 15 1162013 pages. The goal is to provide a minimal subset that is easy to learn and will work for 80% of authors doing simple data markup. RDFa Lite consists of five simple attributes; vocab, typeof, property, resource, and prefix. RDFa 1.1 Lite is completely upwards compatible with the full set of RDFa 1.1 attributes. This means that if an author finds that RDFa Lite isn't powerful enough, transitioning to the full version of RDFa is just a matter of adding the more powerful RDFa attributes into the existing RDFa Lite markup. So, if you wanted to talk about People, the vocabulary that you would use would specify terms like name and telephone number. When we want to mark up things on the Web, we need to do something very similar, which is specify which vocabulary that we are going to be using. Here is a simple example that specifies a vocabulary that we intend to use to markup things in the paragraph: <p vocab="http://schema.org/"> My name is Manu Sporny and you can give me a ring via 1-800-555-0199. </P> In this example we have specified that we are going to be using the vocabulary that can be found at http://schema.org/. Once we have specified the vocabulary, we need to specify the type of the thing that we're talking about. In this particular case we are talking about a Person, which can be marked up like so: <p vocab="http://schema.org/" typeof="Person"> My name is Manu Sporny and you can give me a ring via 1-800-555-0199. </P> Now all we need to do is specify which properties of that person we want 16 1162013 to point out to the search engine. In the following example, we mark up the person's name, phone number and web page. Both text and URLs can be marked up with RDFa Lite. In the following example, pay particular attention to the types of data that are being pointed out to the search engine: <p vocab="http://schema.org/" typeof="Person"> My name is <span property="name">Manu Sporny</span> and you can give me a ring via <span property="telephone">1-800-555-0199</span> or visit <a property="url" href="http://manu.sporny.org/">my homepage</a>. </P> Now, when somebody types in “phone number for Manu Sporny” into a search engine, the search engine can more reliably answer the question directly, or point the person searching to a more relevant Web page. Thus RDFa Lite has the same expressivity as Microdata, but has additional desirable properties such as the ability to talk about resources and define prefixes. If you want Web authors to be able to talk about each thing on your page, you need to create an identifier for each of these things. Just like we create identifiers for parts of a page using the id attribute in HTML, you can create identifiers for things described on a page using the resource attribute: <p vocab="http://schema.org/" resource="#manu" typeof="Person"> My name is <span property="name">Manu Sporny</span> and you can give me a ring via <span property="telephone">1-800-555-0199</span>. <img property="image" src="http://manu.sporny.org/images/manu.png" / > 17 1162013 </P> If we assume that the markup above can be found at http://example.org/ people, then the identifier for the thing is the address, plus the value in the resource attribute. Therefore, the identifier for the thing on the page would be: http://example.org/people#manu. This identifier is also useful if you want to talk about that same thing on another Web page. By identifying all things on the Web using a unique Uniform Resource Locator (URL), we can start building a Web of things. Companies building software for the Web can use this Web of things to answer complex questions like: "What is Manu Sporny's phone number and what does he look like?". In some cases, a vocabulary may not have all of the terms an author needs when describing their thing. The last feature in RDFa 1.1 Lite that some authors might need is the ability to specify more than one vocabulary. For example, if we are describing a Person and we need to specify that they have a favorite animal, we could do something like the following: <p vocab="http://schema.org/" prefix="ov: http://open.vocab.org/terms/" resource="#manu" typeof="Person"> My name is <span property="name">Manu Sporny</span> and you can give me a ring via <span property="telephone">1-800-555-0199</span>. <img property="image" src="http://manu.sporny.org/images/manu.png" / > My favorite animal is the <span property="ov:preferredAnimal">Liger</span>. </p> The example assigns a short-hand prefix to the Open Vocabulary (ov) and uses that prefix to specify the preferredAnimal vocabulary term. Since schema.org doesn't have a clear way of expressing a favorite animal, the author instead depends on this alternate vocabulary to get the job done. 18 1162013 RDFa 1.1 Lite also pre-defines a number of useful and popular prefixes, such as dc, foaf, and schema. This ensures that even if authors forget to declare the popular prefixes, that their structured data will continue to work. A full list of pre-declared prefixes can be found in the initial context document for RDFa 1.1. 19