Lecture notes for INFO116 Lecture 3: Adding Semantics to data

advertisement
1162013
Lecture notes for INFO116
Lecture 3: Adding Semantics to data
Dr. Csaba Veres,
Institute for Information and Media Science,
The University of Bergen
In order to make any data (temperature, location, etc.), Web resource
(image, audio file, video stream, etc.), or real-world entity (painting, IP
camera, can of milk, airline reservation, etc.), collectively abbreviated as
objects, machine understandable, we need to describe their characteristics/
associated information in a machine processable manner. This requires
determining what aspects of the objects to abstract and describe, the semantic
modeling problem, and how to materialize these descriptions in a machine
processable form that can elicit semantic comprehension and facilitate
interoperability, the representation and reasoning problem.
Different forms of data and their semantics
• Unstructured data
⁃
Grammatical text
1
1162013
Grammatical text is exemplified by articles appearing in newspapers (such
as those in The Wall Street Journal, or The New York Times), research
journals (such as those in PubMed), magazines, and books. The text is
characterized by the use of full sentences satisfying grammatical constraints.
Automatic abstraction and assimilation of content can benefit from linguistic
(syntactic and semantic) structure as well as domain-semantics provided by
background knowledge (e.g., in bioinformatics/medical, economics, and
technology).
There is no technology that can “understand” text at a level of a human
reader, so the semantics which can be extracted from unstructured data
involves named entity extraction, extraction of relationships, and
“understanding” a restricted set of inputs (e.g. Siri). Information extraction
from grammatical text often makes use of syntactic patterns and contexts to
recognize entity types (such as person, place, currency, and date) and
relationships.
For example, from a passage like:
(1) The fourth Wells account moving to another agency is the packaged
paper-products division of Georgia-Pacific Corp., which arrived at Wells only
last fall. Like Hertz and the History Channel, it is also leaving for an
Omnicom-owned agency, the BBDO South unit of BBDO Worldwide.
BBDO South in Atlanta, which handles corporate advertising for GeorgiaPacific, will assume additional duties for brands like Angel Soft toilet tissue
and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia- Pacific
in Atlanta.
Extract a table like:
2
1162013
(From NLTK toolkit documentation http://nltk.org/book/ch07.html)
Information Extraction Architecture
A lot of academic work on text analysis has now culminated in the
development of UIMA (Unstructured Information Management
Architecture) that provides a common framework for content analytics
(specifically, processing information to extract meaning and create structured
data).
3
1162013
Unstructured Information Management applications are software systems
that analyze large volumes of unstructured information in order to discover
knowledge that is relevant to an end user. An example UIM application might
ingest plain text and identify entities, such as persons, places, organizations;
or relations, such as works-for or located-at.
UIMA enables applications to be decomposed into components, for
example "language identification" => "language specific segmentation" =>
"sentence boundary detection" => "entity detection (person/place names etc.)".
Each component implements interfaces defined by the framework and
provides self-describing metadata via XML descriptor files. The framework
manages these components and the data flow between them. Components are
written in Java or C++; the data that flows between components is designed
for efficient mapping between these languages.
UIMA additionally provides capabilities to wrap components as network
services, and can scale to very large volumes by replicating processing
pipelines over a cluster of networked nodes.
Apache UIMA is an Apache-licensed open source implementation of the
UIMA specification (that specification is, in turn, being developed
concurrently by a technical committee within OASIS , a standards
organization).
http://uima.apache.org
⁃
Application specific user generated content
User-generated content is exemplified by citizen microblogs (such as from
Twitter) and exchanges on social media (such as on Facebook). (Even though
blogs fall under UGC, their structure is closer to grammatical text than the
4
1162013
text found on Twitter and Facebook, so they are excluded.) The UGC text is
characterized by being terse, colloquial, context-poor, and containing words/
phrases with nonstandard/creative, homophone-based spellings. Automatic
assimilation of content requires background knowledge (in the form of
ontologies) to deal with implicit spatio-temporal-thematic context, and novel
techniques and tools to summarize and visualize aggregated content to deal
with its large size and real-time nature. Robust, scalable, and high-quality
analysis of user-generated content is an active area of contemporary research.
E.g. Sentiment analysis
http://www.sentiment140.com
http://sentiwordnet.isti.cnr.it
• Semi structured data
Semi-structured data are normally text data that contain tags or other
markers to delimit and capture hierarchical and aggregation structure
associated with semantic fields within the data. HTML documents1 and
XML documents with domain-specific tags (such as DBLP XML records)
are classic examples of semi-structured data.
Typically, an XML document consists of tagged text (that is, markup
inter-twined with text) in which the tag makes explicit the category and the
properties of the enclosed text using attribute-value pairs. Even though the
5
1162013
tags in semi-structured data are syntactic in nature, they are conducive to
incorporating semantics in that the tags can be chosen from a standard
semantic model or potentially mapped to one. For example, the tags can be
display- oriented providing information such as color and font, or semanticsoriented providing information such as type and normal-form. Semantic tags
permit abstraction of meaning in a standard form. In summary, semistructured documents can contain descriptive text that is human
comprehensible and semantic tags that are machine processable.
• Structured data
Structured data have well-defined formal syntax and an associated data
model. Relational databases, machine sensor data stream (such as from
sensors on a weather station), RDF triples, and XML serialization of a Webservice call (for data interchange) all illustrate the variety in structured data.
By definition, structured data can be parsed with relative ease. To provide
semantics, we need to associate real-world objects/entities and their
relationships with the structured data on the basis of their schema. To
comprehend machine sensor data, it is also necessary to incorporate spatiotemporal-thematic context using information about time of measurement,
location, and type of sensor, physical quantity measured, unit of measure, etc.
• Multimedia data
We lump all other non-textual content together as multimedia data, which
includes photos, videos, and audio clips found on social-media sharing
services such as Flickr and YouTube. Due to the difficulties involved in
analyzing multimedia content, the only viable approach to gleaning semantics
is through the analysis of associated textual descriptors (metadata) and
community comments.
Techniques for adding semantics to data
6
1162013
• MICROFORMATS
A microformat is an approach to semantic markup that seeks to reuse
existing HTML/XHTML tags to incorporate metadata. Specifically, it
embeds and encodes semantics within the attributes of markup tags. This
approach allows contact information, geographic coordinates, and calendar
events, etc., to be both human readable and machine processable (using
hCard, geo, and hCalendar microformats respectively) as illustrated below.
The text:“Dayton, OH is located at 39.59, -84.22.” can be annotated using
geo-microformat as:
Dayton , OH is located at <span class ="geo">
<span class =" latitude ">39.59</span >,
<span class="longitude">−84.22</span> </ span >
The HTML fragment
<div >
<div >Amit Sheth </ div >
<div>Kno.e.sis Center</div> <div>937−775−5217</div> <a href="http ://
knoesis .org/">http :// knoesis .org/</a>
</ div >
can be semantically enhanced using the hCard microformat markup as
<div class=card">
<div class="fn">Amit Sheth</div>
<div class="org">Kno.e.sis Center</div>
<div class =" tel ">937−775−5217</div >
<a class="url" href="http://knoesis.org/">http://knoesis.org/</a>
</ div >
The class attribute values fn, org, tel, and url stand for formatted name,
organization, telephone number, and Uniform Resource Locator respectively.
hCard is an HTML version of the vCard standard.
Microformats is a commuity driven effort to extend semantics to well
specified, precise domains, e.g.
⁃
hCalendar - events
7
1162013
⁃
hCard - people, organizations, contacts
⁃
rel-license - licensed content
⁃
rel-nofollow - links in untrusted 3rd party content
⁃
rel-tag - tag posts and pages by subject
⁃
XFN - social relationships and rel-me links among profiles for the same
person
⁃
XMDP - define a microformat vocabulary / profile
⁃
XOXO - outlines
The microformats principles
⁃
Solve a specific problem
⁃
Start as simple as possible
⁃
Design for humans first, machines second
⁃
Reuse building blocks from widely adopted standards
⁃
Modularity / embeddability
⁃
Enable and encourage decentralized development, content, services
•
RDFA
RDFa (Resource Description Framework - in - attributes) adds a set of
attribute-level extensions to XHTML for embedding rich metadata within
Web documents. RDFa is a W3C-proposed standard and a markup language
that enables the layering of RDF information on any XHTML or XML
document. RDFa provides a set of attributes that can represent semantic
metadata within an XML language from which RDF triples can be extracted
using simple mapping.
Use of RDFa improves traceability and minimizes duplication of
information in comparison with an approach that maintains a separate
translation/abstraction of a document into RDF.
8
1162013
The core subset of RDFa attributes include
• about—a URI extracted as the subject of an RDF triple that specifies
the resource the metadata is about;
• rel and rev—extracted as the object property (predicate) of an RDF
triple, this URI specifies a relationship or reverse-relationship with another
resource;
• href, src, and resource—extracted as the object of an RDF triple, this
URI specifies the partner resource;
• property—extracted as the datatype property (predicate) of an RDF
triple, this URI specifies a property for the content of an element; and
instanceof—extracted as the object property “rdf:type” coupled with an
RDF triple’s object, this optional attribute specifies the RDF type of the
subject (the resource that the metadata is about).
One of the biggest users of RDFa is Facebook, with Open Graph Protocol
9
1162013
(OGP). The Open Graph protocol enables any web page to become a rich
object in a social graph by specifying metadata terms such as title, type,
image, and URL
<div prefix="og: http://ogp.me/ns# dc: http://purl.org/dc/elements/1.1/"
about="/photos">
<h2 property="og: title">Hubble Space Telescope ’s Top Images</h2>
Sombrero Galaxy (M104) is an edge−on spiral galaxy , 50,000 light−years
across
and 28 million light years from Earth.
<div about="http://example.com/photos/sombrero.jpg">
<img src="http://example.com/photos/sombrero.jpg" />
<span property="dc:title">The Majestic Sombrero Galaxy (M104) </span
>
<span property ="dc:creator ">Hubble Space Telescope
</span >
</div > </div >
The above RDFa example shows use of Open Graph Protocol and Dublin
Core vocabularies (including an association of a title and a creator to an
image). The Dublin Core specifies metadata terms to describe Web and
physical resources such as title, creator, publisher, and date. The resulting
extracted RDF triples are shown below.
< h t t p : / / e x a m p l e . com / p h o t o s >
<http://ogp.me/ns#title > "Hubble Space Telescope ’s Top Images"
< h t t p : / / e x a m p l e . com / p h o t o s / s o m b r e r o . j p g >
<http://purl.org/dc/elements/1.1/title >"The Majestic Sombrero Galaxy
(M104)"
< h t t p : / / e x a m p l e . com / p h o t o s / s o m b r e r o . j p g > <http://
purl.org/dc/elements/1.1/creator> "Hubble Space Telescope"
http://www.w3.org/2012/pyRdfa/#distill_by_input :
10
1162013
@prefix dc11: <http://purl.org/dc/elements/1.1/> .
@prefix og: <http://ogp.me/ns#> .
</photos> og:title "Hubble Space Telescope ’s Top Images" .
<http://example.com/photos/sombrero.jpg> dc11:creator """Hubble Space
Telescope
""";
dc11:title "The Majestic Sombrero Galaxy (M104) " .
RDFa can be quite complicated because it is relatively expressive. As a
result, a competing proposal Microdata has emerged.
•
MICRODATA
Microdata is a proposed HTML5 specification used to embed semantics
within Web documents. It is an attempt to provide a simpler alternative to
approaches such as RDFa and Microformats for annotating HTML elements,
to enable applications such as search engines and Web crawlers to better
assimilate web page content. It is an essential feature of the schema.org effort.
<section itemscope itemtype="http://example.org/animals#cat">
<h1 itemprop="name http :// example .com/fn">Hedral </h1>
<p itemprop="desc">Hedral is a male american domestic shorthair , with a
fluffy
<span itemprop="http://example.com/color">black</span> fur with
<span itemprop="http://example.com/color">white</span> paws and belly.
</p> <img itemprop="img" src="hedral.jpeg" alt="" title="Hedral, age 18
months">
</section >
schema.org markup with Microdata
11
1162013
Let's start with a concrete example. Imagine you have a page about the
movie Avatar—a page with a link to a movie trailer, information about the
director, and so on. Your HTML code might look something like this:
<div>
<h1>Avatar</h1>
<span>Director: James Cameron (born August 16, 1954)</span>
<span>Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>
To begin, identify the section of the page that is "about" the movie Avatar.
To do this, add the itemscope element to the HTML tag that encloses
information about the item, like this:
<div itemscope>
<h1>Avatar</h1>
<span>Director: James Cameron (born August 16, 1954) </span>
<span>Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</Div>
By adding itemscope, you are specifying that the HTML contained in the
<div>...</div> block is about a particular item.
But it's not all that helpful to specify that there is an item being discussed
without specifying what kind of an item it is. You can specify the type of item
using the itemtype attribute immediately after the itemscope.
<div itemscope itemtype="http://schema.org/Movie">
<h1>Avatar</h1>
<span>Director: James Cameron (born August 16, 1954)</span>
<span>Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>
12
1162013
This specifies that the item contained in the div is in fact a Movie, as
defined in the schema.org type hierarchy. Item types are provided as URLs,
in this case http://schema.org/Movie.
What additional information can we give search engines about the movie
Avatar? Movies have interesting properties such as actors, director, ratings.
To label properties of an item, use the itemprop attribute. For example, to
identify the director of a movie, add itemprop="director" to the element
enclosing the director's name. (There's a full list of all the properties you can
associate with a movie at http://schema.org/Movie.)
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<span>Director: <span itemprop="director">James Cameron</span>
(born August 16, 1954)</span>
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html"
itemprop="trailer">Trailer</a>
</div>
Note that we have added additional <span>...</span> tags to attach the
itemprop attributes to the appropriate text on the page. <span> tags don't
change the way pages are rendered by a web browser, so they are a
convenient HTML element to use with itemprop.
Search engines can now understand not just that http://
www.avatarmovie.com is a URL, but also that it's the URL for the trailer for
the science-fiction movie Avatar, which was directed by James Cameron.
Sometimes the value of an item property can itself be another item with its
own set of properties. For example, we can specify that the director of the
movie is an item of type Person and the Person has the properties name and
birthDate. To specify that the value of a property is another item, you begin a
new itemscope immediately after the corresponding itemprop.
13
1162013
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name"&g;Avatar</h1>
<div itemprop="director" itemscope itemtype="http://schema.org/
Person">
Director: <span itemprop="name">James Cameron</span> (born <span
itemprop="birthDate">August 16, 1954)</span>
</div>
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html"
itemprop="trailer">Trailer</a>
</div>
Microdata has had a difficult lifecycle in some circles, as it was possibly ill
conceived as a competitor to the already established RDFa standard.
http://manu.sporny.org/2013/microdata-downward-spiral/ :
“A number of observers have been surprised by these events, but for those
that have been involved in the month-to-month conversation around
Microdata, it makes complete sense. Microdata doesn’t have an active
community supporting it. It never really did. For a Web specification to be
successful, it needs an active community around it that is willing to do the
hard work of building and maintaining the technology. RDFa has that in
spades, Microdata does not.
Microdata was, primarily, a shot across the bow at RDFa. The warning
worked because the RDFa community reacted by creating RDFa Lite, which
matches Microdata feature-for-feature, while also supporting things that
Microdata is incapable of doing. The existence of RDFa Lite left the HTML
Working Group in an awkward position. Publishing two specifications that
did the exact same thing in almost the exact same way is a position that no
14
1162013
standards organization wants to be in. At that point, it became a race to see
which community could create the developer tools and support web
developers that were marking up pages.
Microdata, to this day, still doesn’t have a specification editor, an active
community, a solid test suite, or any of the other things that are necessary to
become a world class technology. To be clear, I’m not saying Microdata is
dying (4 million out of 329 million domains use it), just that not having these
basic things in place will be very problematic for the future of Microdata.”
Manu has always maintained that the Microdata effort, together with
schema.org is an effort to control web vocabulary, especially since Facebook
already uses RDFa.
http://manu.sporny.org/2011/false-choice/
•
RDFA LITE
http://www.w3.org/TR/rdfa-lite/
RDFa Lite is a minimal subset of RDFa, the Resource Description
Framework in attributes, consisting of a few attributes that may be used to
express machine-readable data in Web documents like HTML, SVG, and
XML. While it is not a complete solution for advanced data markup tasks, it
does work for most day-to-day needs and can be learned by most Web
authors in a day.
The full RDFa syntax [RDFA-CORE] provides a number of basic and
advanced features that enable authors to express fairly complex structured
data, such as relationships among people, places, and events in an HTML or
XML document. Some of these advanced features may make it difficult for
authors, who may not be experts in structured data, to use RDFa. This
lighter version of RDFa is a gentler introduction to the world of structured
data, intended for authors that want to express fairly simple data in their web
15
1162013
pages. The goal is to provide a minimal subset that is easy to learn and will
work for 80% of authors doing simple data markup.
RDFa Lite consists of five simple attributes; vocab, typeof, property,
resource, and prefix. RDFa 1.1 Lite is completely upwards compatible with
the full set of RDFa 1.1 attributes. This means that if an author finds that
RDFa Lite isn't powerful enough, transitioning to the full version of RDFa is
just a matter of adding the more powerful RDFa attributes into the existing
RDFa Lite markup.
So, if you wanted to talk about People, the vocabulary that you would use
would specify terms like name and telephone number. When we want to mark
up things on the Web, we need to do something very similar, which is specify
which vocabulary that we are going to be using. Here is a simple example
that specifies a vocabulary that we intend to use to markup things in the
paragraph:
<p vocab="http://schema.org/">
My name is Manu Sporny and you can give me a ring via
1-800-555-0199.
</P>
In this example we have specified that we are going to be using the
vocabulary that can be found at http://schema.org/. Once we have specified
the vocabulary, we need to specify the type of the thing that we're talking
about. In this particular case we are talking about a Person, which can be
marked up like so:
<p vocab="http://schema.org/" typeof="Person">
My name is Manu Sporny and you can give me a ring via
1-800-555-0199.
</P>
Now all we need to do is specify which properties of that person we want
16
1162013
to point out to the search engine. In the following example, we mark up the
person's name, phone number and web page. Both text and URLs can be
marked up with RDFa Lite. In the following example, pay particular
attention to the types of data that are being pointed out to the search engine:
<p vocab="http://schema.org/" typeof="Person">
My name is
<span property="name">Manu Sporny</span>
and you can give me a ring via
<span property="telephone">1-800-555-0199</span>
or visit
<a property="url" href="http://manu.sporny.org/">my homepage</a>.
</P>
Now, when somebody types in “phone number for Manu Sporny” into a
search engine, the search engine can more reliably answer the question
directly, or point the person searching to a more relevant Web page.
Thus RDFa Lite has the same expressivity as Microdata, but has
additional desirable properties such as the ability to talk about resources and
define prefixes.
If you want Web authors to be able to talk about each thing on your page,
you need to create an identifier for each of these things. Just like we create
identifiers for parts of a page using the id attribute in HTML, you can create
identifiers for things described on a page using the resource attribute:
<p vocab="http://schema.org/" resource="#manu" typeof="Person">
My name is
<span property="name">Manu Sporny</span>
and you can give me a ring via
<span property="telephone">1-800-555-0199</span>.
<img property="image" src="http://manu.sporny.org/images/manu.png" /
>
17
1162013
</P>
If we assume that the markup above can be found at http://example.org/
people, then the identifier for the thing is the address, plus the value in the
resource attribute. Therefore, the identifier for the thing on the page would
be: http://example.org/people#manu. This identifier is also useful if you want
to talk about that same thing on another Web page. By identifying all things
on the Web using a unique Uniform Resource Locator (URL), we can start
building a Web of things. Companies building software for the Web can use
this Web of things to answer complex questions like: "What is Manu Sporny's
phone number and what does he look like?".
In some cases, a vocabulary may not have all of the terms an author needs
when describing their thing. The last feature in RDFa 1.1 Lite that some
authors might need is the ability to specify more than one vocabulary. For
example, if we are describing a Person and we need to specify that they have
a favorite animal, we could do something like the following:
<p vocab="http://schema.org/" prefix="ov: http://open.vocab.org/terms/"
resource="#manu" typeof="Person">
My name is
<span property="name">Manu Sporny</span>
and you can give me a ring via
<span property="telephone">1-800-555-0199</span>.
<img property="image" src="http://manu.sporny.org/images/manu.png" /
>
My favorite animal is the <span
property="ov:preferredAnimal">Liger</span>.
</p>
The example assigns a short-hand prefix to the Open Vocabulary (ov) and
uses that prefix to specify the preferredAnimal vocabulary term. Since
schema.org doesn't have a clear way of expressing a favorite animal, the
author instead depends on this alternate vocabulary to get the job done.
18
1162013
RDFa 1.1 Lite also pre-defines a number of useful and popular prefixes,
such as dc, foaf, and schema. This ensures that even if authors forget to
declare the popular prefixes, that their structured data will continue to work.
A full list of pre-declared prefixes can be found in the initial context
document for RDFa 1.1.
19
Download