Harnessing Social Streams Professor Fabio Ciravegna Department of Computer Science, University of Sheffield

advertisement
© Fabio Ciravegna, University of Sheffield
Harnessing Social Streams
Professor Fabio Ciravegna
Department of Computer Science,
University of Sheffield
fabio@dcs.shef.ac.uk
1
Picture from http://thesocietypages.org/sociologylens/tag/nathan-jurgenson/
Wednesday, 19 September 12
Web 2.0
1991
1995
2000
2005
2009
2005
Jan 2001
Wikipedia
2009
2006
1999 1999
August
2002
April 2003
Apple’s iTunes: bound to become the largest store in
the world for digital music
(175 million songs in first quarter of 2007)
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
2004
facebo
Source: BBC
2005
YouTube
Web 2.0: Trends
•
From connecting documents to connecting people
•
User trends
•
Harnessing collective intelligence
•
Trusting users as co-developers
•
Leveraging the long tail through customer self-service
© Fabio Ciravegna, University of Sheffield
http://oreilly.com/web2/archive/what-is-web-20.html
http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html
Wednesday, 19 September 12
3
Collective Intelligence
• Collective
intelligence is a shared or group intelligence that
emerges from the collaboration and competition of many
individuals
intelligence appears in a wide variety of forms of
consensus decision making in bacteria, animals, humans,
and computer networks
• According
to Don Tapscott and Anthony D. Williams,
collective intelligence is mass collaboration.
•
In order to happen, four principles need to exist
•
openness, peering, sharing and acting globally.
© Fabio Ciravegna, University of Sheffield
• Collective
Definitions from Wikipedia
4
Wednesday, 19 September 12
Harnessing collective intelligence
•
is the foundation of the web.
As users add new content, and new sites, it is bound in to the structure
of the web by other users discovering the content and linking to it.
•
Much as synapses form in the brain, with associations becoming stronger
through repetition or intensity, the web of connections grows organically as an
output of the collective activity of all web users.
• Google's
breakthrough in search was PageRank
•
A method of using the link structure of the web rather than just the
characteristics of documents to provide better search results.
•
What is popular is important
•
Many hyperlinks reaching the page
•
Important hyperlinks reaching the page
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Hyperlinking
Harnessing collective intelligence
Let people tell you what they know/what they like
•
Social bookmarking
© Fabio Ciravegna, University of Sheffield
•
StumbleUpOn
Wednesday, 19 September 12
Harnessing Collective Intelligence: Wikipedia
an online encyclopaedia based on the unlikely
notion that an entry can be added by any web user, and
edited by any other,
•
A radical experiment in trust
•
Eric Raymond's dictum (originally coined in the context of open source
software)
"with enough eyeballs, all bugs are shallow,"
• Wikipedia
is already in the top 10 websites. This is a
profound change in the dynamics of content creation!
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Wikipedia,
© Fabio Ciravegna, University of Sheffield
In principle: yet another encyclopaedia
But you can edit the
content!
The collective intelligence
kicks in
Wednesday, 19 September 12
8
© Fabio Ciravegna, University of Sheffield
Discussion
9
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
The history to track changes (trust the users
but...)
10
Wednesday, 19 September 12
Harnessing: The Mechanical Turk (amazon.com)
web service that Amazon is calling 'artificial artificial
intelligence.'
• If
you need a process completed that only humans can do
given current technology (judgment calls, text drafting or
editing, etc.),
•
you can simply make a request to the service to complete the process.
• The
machine will then complete the task with volunteers,
and return the results to your software."
https://www.mturk.com/mturk/welcome
Image from wikipedia
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•A
Other examples
Answers
© Fabio Ciravegna, University of Sheffield
• Yahoo
Wednesday, 19 September 12
Collective Intelligence: Openness
• During
the early ages of the communications technology,
people and companies are reluctant to share ideas,
intellectual property and encourage self-motivation
The reason for this is these resources provide the edge over
competitors
• Now
people and companies tend to loosen hold over
these resources because they reap more benefits in doing
so
•
By allowing others to share ideas and bid for franchising, their products
are able to gain significant improvement and scrutiny through
collaboration
© Fabio Ciravegna, University of Sheffield
•
13
Wednesday, 19 September 12
Collective Intelligence: Peering
is a form of horizontal organisation with the capacity
to create information technology and physical products
•
The ‘opening up’ of the Linux program where users are free to modify
and develop it provided that they made it available for others.
•
Participants in this form of collective intelligence have different
motivations for contributing,
•
•
but the results achieved are for the improvement of a product or service
“Peering succeeds because it leverages self-organisation
•
a style of production that works more effectively than hierarchical management
for certain tasks”
© Fabio Ciravegna, University of Sheffield
• This
14
Wednesday, 19 September 12
Collective Intelligence: Sharing
and more companies have started to share some
intellectual property, while maintaining some degree of
control over others, like potential and critical patent rights
•
This is because companies have realised that by limiting all their
intellectual property, they are shutting out all possible opportunities
•
Sharing some has allowed them to expand their market and bring
products out more quickly
Google with their free services and Apple’s App store
are examples of opening up part of the IP so to allow
(nearly) everyone to contribute
© Fabio Ciravegna, University of Sheffield
• More
15
Wednesday, 19 September 12
Collective Intelligence: Acting Globally
emergence of communication technology has
prompted the rise of global companies, or e-Commerce
•
E-Commerce has allowed individuals to set up businesses at almost no
or low overhead costs
•
As the influence of the Internet is widespread, a globally integrated
company would have no geographical boundaries
•
Music industry is ubiquitous, you can sell from everywhere
•
•
Selling a dishwasher is more complex
They would also have global connections, allowing them to gain access
to new markets, ideas and technology
© Fabio Ciravegna, University of Sheffield
• The
16
Wednesday, 19 September 12
5 Great Ways of Harnessing Collective
Dion Hinchcliffe
Intelligence
http://www.kreeo.com/#cbok/Collective_Intelligence/contents
•
The Hub of A Hard To Recreate Data Source
This is a classic Web 2.0 concept and success here often devolves to
being the first entry with an above average implementation.
•
Wikipedia, eBay, and others which are almost entirely the sum of the content
their users contribute.
•
Just be careful and avoid crowded niches, like peer production news.
The Wikipedia content: from competitive
advantage to common heritage
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Be
5 Great Ways of Harnessing Collective
Dion Hinchcliffe
Intelligence
http://www.kreeo.com/#cbok/Collective_Intelligence/contents
•
Collective Intelligence Out
Google uses hyperlink analysis to determine the relevance of a given
page and builds its own database of content which it then shares
through its search engine.
•
Not only does this approach completely avoid a dependency on the ongoing
kindness of strangers it also lets you build a very big content base from the
outset.
Do you think
that this ad
will work?
(note this is a
spoof ad)
Wednesday, 19 September 12
A personal appeal from
Microsoft founder Bill Gates
Help Bing! index pages
© Fabio Ciravegna, University of Sheffield
• Seek
5 ways (ctd)
Large-Scale Network Effects
•
This is arguably harder to do than either of the methods above
•
Small examples can be found in things like the Million Dollar Pixel Page.
•
Probably not very repeatable, but when they happen, they can happen
big.
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Trigger
SMS, Web, Email, Radio,
Phone, Twitter, Facebook,
Television, List-serves, Live
streams, Situation Reports
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
http://haiti.ushahidi.com/
5 ways (ctd)
a Reverse Intelligence Filter
•
The idea is that hyperlinks, trackbacks, and other information references
can be counted and used as a reference to determine what it's
important.
•
The blogosphere is the greatest example of this
•
Combined with temporal filters and other techniques and you can create
situation awareness engines easily.
•
It sounds similar to #2 but it's different in that it can be used with or
without external data sources and is aimed not at finding but at eliding
the irrelevant altogether as an active filter.
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Create
Trusting users as co-developers
• Let people play with your data
• Let people play with your software
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
Involving users as co-developers:
sells the same products as competitors such as
Barnesandnoble.com,
•
They receive the same product descriptions, cover images, and editorial
content from their vendors.
• What
is the difference, then?
•
Barnesandnoble.com search is likely to lead with the company's own
products, or sponsored results
•
Amazon has made a science of user engagement.
•
They have an order of magnitude more user reviews, invitations to participate in
varied ways on virtually every page
•
They use user activity to produce better search results.
•
Wednesday, 19 September 12
Amazon always leads with "most popular", a real-time computation based on
•
Sales
•
"Flow" around products
© Fabio Ciravegna, University of Sheffield
• Amazon
© Fabio Ciravegna, University of Sheffield
Involving users as developers: amazon.co.uk
24
Wednesday, 19 September 12
amazon.co.uk as Web 2.0 and As old Shop
Keeper
Instruction to publishers
to provide additional
service
(advantage to both: cooperation to sell more)
details about the service and
offers (standard e-commerce
services)
more details about the product
(standard service)
cluster interests based on
past experience
(like the old shop keeper)
25
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Look Inside like in a shop
the judgment is shared
by the users
You can judge by yourself
about the quality by the
comment (you may
disagree)
the community helps
judge the quality of the
comment
Comments on the
comments
Comments are published
positive or negative (or so
it seems!)
26
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Customer for customers
Amazon.com: Personalised Services
• This is my space
© Fabio Ciravegna, University of Sheffield
• I can attach my wishes to it
• my friends can see it and send me a present
• a mediator between people
• like the old shop keepers who know the wishes of
their customers
27
Wednesday, 19 September 12
Supporting community life
• Let people upload their data and share it
• Creating, supporting and free communities
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
The Social Web
The social web can be described as people interlinked and interacting with
engaging content in a conversational and participatory manner via the WWW
•
•
people are brought together through a variety of shared interests.
Since social web applications are built to encourage communication between
people, they typically emphasise some combination of the following social
attributes:
•
Identity: who are you?
•
Reputation: what do people think you stand for?
•
Presence: where are you?
•
Relationships: who are you connected with? who do you trust?
•
Groups: how do you organize your connections?
•
Conversations: what do you discuss with others?
•
Sharing: what content do you make available for others to interact with?
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
http://en.wikipedia.org/wiki/Social_Web
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Or in short...
The social web (ctd)
• Two
•
"people focus": the person is the focus of social interaction
•
A profile is constructed by each user
•
e.g. Bebo, Facebook, and Myspace.
"hobby focus"
•
e.g. photography websites such as Flickr, Kodak Gallery and Photobucket
•
Pinterest is the new hobby-related site
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
types of Web sites:
2010
http://royal.pingdom.com/2010/02/16/study-ages-of-social-network-users/
Wednesday, 19 September 12
From Wikipedia
•
A "folksonomy" is a collaboratively generated,
open-ended labeling system that enables Internet
users to categorize content such as Web pages,
online photographs, and Web links.
•
The freely chosen labels -- called tags -- help improve search engine's
effectiveness because content is categorized using a familiar, accessible,
and shared vocabulary.
•
The labeling process is called tagging.
•
Because folksonomies develop in Internet-mediated social environments,
users can discover (generally) who created a given folksonomy tag, and
see the other tags that this person created.
•
In this way, folksonomy users often discover the tag sets of another user
who tends to interpret and tag content in a way that makes sense to
them.
•
The result, often, is an immediate and rewarding gain in the user's
capacity to find related content.
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Folksonomy
Tag Cloud (or Folksonomy)
•Part of the appeal of folksonomy is its inherent
subversiveness:
© Fabio Ciravegna, University of Sheffield
• Faced with the dreadful performance of the search tools that Web sites
typically provide,
• Folksonomies can be seen as a rejection of the search engine status quo in favor of
tools that are both created by the community and beneficial to the community.
•Folksonomies arise in Web-based communities where special
provisions are made at the site level for creating and using
tags.
•These communities are established to enable Web users to
label and share user-generated content, such as
photographs, or to collaboratively label existing content, such
as Web sites, books, works in the scientific and scholarly
literatures, and blog entries.
34
Wednesday, 19 September 12
Blogging and the Wisdom of Crowds
•One of the most highly touted features of the Web 2.0 era
is the rise of blogging.
© Fabio Ciravegna, University of Sheffield
•Personal home pages have been around since the early
days of the web, and the personal diary and daily opinion
column around much longer than that, so just what is the
fuss all about?
•At its most basic, a blog is just a personal home page in
diary format. But as Rich Skrenta notes, the chronological
organization of a blog "seems like a trivial difference, but it
drives an entirely different delivery, advertising and value
chain."
35
Wednesday, 19 September 12
Typing What I'm Thinking To Everyone Reading
Twitter (2006)
is a social networking and micro-blogging service
that allows its users to send and read other users'
updates (known as tweets),
•
text-based posts of up to 140 characters in length.
•
Updates are displayed on the user's profile page and delivered to other
users who have signed up to receive them.
http://en.wikipedia.org/wiki/Twitter
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Twitter
Tweet?
•
Twitter now has more than 140 million active users, sending 340 million tweets every
day (3900/second on average, peaks of 25,000)
•
Twitter users send over a billion tweets every 72 hours
•
The average Twitter user has 126 followers
•
•
The average Twitter users tweets a link at 9am, achieves a 1.17% click through
rate (1.4 followers click the link)
•
Over 40% of Twitter users do not tweet anything
•
About 0.05% of the total twitter population attract almost 50% of attention on
the channel
No one:
•
71% of the millions of tweets each day attract no reaction
•
25% of Twitter users have no followers
•
45% of the mass communications posted on Twitter are nonsense
•
8% of Americans use Twitter
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
99 New Social Media Stats for 2012
The Long Tail
© Fabio Ciravegna, University of Sheffield
• Overture and Google's success came from an understanding of
what Chris Anderson refers to as "the long tail," the collective
power of the small sites that make up the bulk of the web's
content.
• Overture and Google figured out how to enable ad placement on virtually any web
page.
• What's more, they eschewed publisher/ad-agency friendly advertising formats such as banner
ads and popups in favor of minimally intrusive, context-sensitive, consumer-friendly text
advertising.
• The Web 2.0 lesson: leverage customer-self service and
algorithmic data management to reach out to the entire web, to
the edges and not just the center, to the long tail and not just the
head.
38
Wednesday, 19 September 12
RSS Feeds
• One of the things that has made a difference is a technology called RSS.
© Fabio Ciravegna, University of Sheffield
• RSS is the most significant advance in the fundamental architecture of the
web since early hackers realized that CGI could be used to create databasebacked websites.
• RSS allows someone to link not just to a page, but to subscribe to it, with
notification every time that page changes.
• Some call it the "live web".
• Dynamic websites" (i.e., database-backed sites with dynamically generated
content) replaced static web pages well over ten years ago
• What's dynamic about the live web are not just the pages, but the links.
• A link to a weblog is expected to point to a perennially changing page, with
"permalinks" for any individual entry, and notification for each change.
• An RSS feed is thus a much stronger link than, say a bookmark or a link to a
single page.
Wednesday, 19 September 12
39
RSS Feeds (2)
© Fabio Ciravegna, University of Sheffield
•RSS also means that the web browser is not the only
means of viewing a web page.
• While some RSS aggregators, such as Bloglines, are web-based, others
are desktop clients, and still others allow users of portable devices to
subscribe to constantly updated content.
•RSS is now being used to push not just notices of new
blog entries, but also all kinds of data updates, including
stock quotes, weather data, and photo availability.
40
Wednesday, 19 September 12
Harnessing Collective Intelligence in Texts
, University of Sheffield
Wednesday, 19 September 12
Text Analysis
• Two
•
Information Retrieval
•
Keyword matching (as in search engines)
•
Issues:
•
Synonyms (beautiful vs attractive)
•
Polysemous (bank Vs bank Vs bank)
•
This makes extracting information by using keywords interesting effective and a
challenge
•
Several open source tools e.g. Solr
Natural Language Processing
•
Wide spectrum
•
From named entity recognition (e.g. People’s name)
•
To event extraction
•
To modelling user intentions
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
main historic approaches to text analysis
Anatomy of Search Engines: indexing
Ranking Algorithm
Crawler
Ranked Documents
Indexer
Indices
43
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
pages
Indexing
•Crawler
© Fabio Ciravegna, University of Sheffield
•Discovers pages over the web
•Sends pages to Indexer
•Indexer
•Extract keywords and creates tables of indices
•Ranker
Next Slides
•Decides an order of importance/relevance among pages
Next Slides
44
Wednesday, 19 September 12
Indexing via Inverted Files
© Fabio Ciravegna, University of Sheffield
•Extract words from each
document
A picture by M. Lapata
45
Wednesday, 19 September 12
Extracting words
© Fabio Ciravegna, University of Sheffield
•Tokenization: separation of words
•From a document (one big string)
• “I had my porridge hot”
•To a sequence of strings
• “I” “had” “my“ “porridge” “hot”
•Stop word filtering
•Words with a very high frequency but low semantic content (meaning) are
not considered
• the, a, an, on, in, for, of, with, without, etc.
• “had” “porridge” “hot”
46
Wednesday, 19 September 12
Processing
Indexing via Inverted Files (2)
•Create an Index for each word
© Fabio Ciravegna, University of Sheffield
• Keeping track of position of words in files
47
Wednesday, 19 September 12
Towards Semantic Analysis
advanced version of Semantic Analysis (or NLP) is to
look for special types of keywords, e.g. Named Entities or
domain specific terms (terminology in aerospace)
• Identifies
small scalable structures (proper names, dates,
numbers, currencies, etc.)
•
Most classes fixed (e.g. People, Locations)
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• An
NE Recognition & Coreference
Organisation
19:16 Moody's rates Province of Saskatchewan A3
Moody's Investors Service Inc said it assigned an A3 rating
to the Province of Saskatchewan's C$115 million bond
offering that was priced today.
MNY
The sale is a reopening of the province's 9.6 percent bonds
due February 4, 2022. Proceeds will be used for
%
government purposes, mainly Saskatchewan Power Corp.
Date
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
&
Coreference
NE Recognition & Coreference
Organisation
19:16 Moody's rates Province of Saskatchewan A3
Moody's Investors Service Inc said it assigned an A3 rating
to the Province of Saskatchewan's C$115 million bond
offering that was priced today.
MNY
The sale is a reopening of the province's 9.6 percent bonds
due February 4, 2022. Proceeds will be used for
%
government purposes, mainly Saskatchewan Power Corp.
Date
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
NYU: NE Recognition (1994)
•Gazetteer lookup
• Patterns:
Person -> FirstName + Word.Capitalised
Person -> Person + Word.Capitalised
Company -> Word.Capitalised+ <company-indicator>
After Name Recognition
[name type: Person Sam Schwarz] retired as
executive vice president of the famous hot
dog manufacturer,
[name type: Company Hupplewhite Inc.] He will be
succeeded by
[name type: Person Harry Himmelfarb].
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Terminology Recognition
• Task of reducing a term to a URI
51
Wednesday, 19 September 12
Collaborative Filtering
52
Wednesday, 19 September 12
Collaborative Filtering
• The
process of filtering for information or patterns using
techniques involving collaboration among multiple agents,
viewpoints, data sources, etc
Applications of collaborative filtering typically involve very large data sets
• The
method of making automatic predictions about the
interests of a user by collecting taste information from
many users
•
Underlying assumption: those who agreed in the past tend to agree
again in the future
•
Predictions are specific to the user using information from many users
© Fabio Ciravegna, University of Sheffield
•
For example, a collaborative filtering or recommendation system for music
tastes could make predictions about which music a user should like given a
partial list of that user's tastes (likes or dislikes).
53
Wednesday, 19 September 12
Collaborative Filtering
from the more simple approach of giving an
average (non-specific) score for each item of interest, for
example based on its number of votes
© Fabio Ciravegna, University of Sheffield
• Differs
54
Wednesday, 19 September 12
Approaches to Collaborative Filtering: user centric
•
Given a user u
•
Look for users who share the
same rating patterns with u
•
Use their ratings to calculate a
prediction for u
•
Two types:
•
Active filtering: peer-to-peer
based approach
•
peers, coworkers, and people
with similar interests rate
products, reports, and other
material objects, and sharing this
information over the web for
other people to see
© Fabio Ciravegna, University of Sheffield
• User-centric
55
Wednesday, 19 September 12
User Centric (2)
Passive-filtering: collects information implicitly
•
Web browser collects user’s preferences by following and measuring
their actions
•
E.g. Purchasing an item or not
•
See lecture 4 on collective intelligence for search engines
•
The Bing/Google issue
© Fabio Ciravegna, University of Sheffield
•
56
Wednesday, 19 September 12
Approaches to Collaborative Filtering (2)
• Product
centric
Build an item-item matrix determining relationships between pairs of
items
•
Using the matrix, and the data on the current user, infer his taste
© Fabio Ciravegna, University of Sheffield
•
57
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Fas forward: today’s lab class
58
Wednesday, 19 September 12
Seeking Collective Intelligence
We will seek collective intelligence to address situation
awareness thought social media
•
Situation awareness = understanding of situations, e.g.
emergencies
•
We will in particular address:
•
The use of Twitter
•
The modelling of the context using Linked Open Data
•
•
You should know about
The use of maps for displaying information
© Fabio Ciravegna, University of Sheffield
•
59
Wednesday, 19 September 12
Your Exercise
Create a program that enables situation awareness
•
Identify a situation awareness scenario (e.g. fire) and their
associated keywords (e.g. fire brigade)
•
Create a Java program that:
•
Queries Twitter and retrieves relevant posts
•
Saves tweets in a database (e.g. mySql or Google FusionTables)
•
Extracts information on names by sending tweets to Zemanta.org
•
•
•
E.g. Identify references to where and who
Queries DbPedia.org to get further references “around the entities
mentioned”
•
Note: you should have a clear plan on what “around” means
•
You will have to design a very small ontology based on the types returned by Zemanta
Display tweets and related concepts in a relevant way, for example using
maps for where and related concepts
© Fabio Ciravegna, University of Sheffield
•
60
Wednesday, 19 September 12
Example: crashes
Identify the scenario (e.g. road crashes)
• Query Twitter for relevant messages
• Devise relevant queries and methods to classify relevant messages - for
example use regular expressions such as
• <xxx> crash(ed) … on ...
© Fabio Ciravegna, University of Sheffield
•
61
Wednesday, 19 September 12
Examples
Link:
anchor=Regent Street
confidence=0.862714
entity_type=
Target:
url=http://maps.google.com/maps?
ll=51.5108,-0.1387&spn=0.01,0.01&q=51.5108,-0.1387 (Regent%20Street)&t=h
title=Regent Street
© Fabio Ciravegna, University of Sheffield
type=geolocation
62
Wednesday, 19 September 12
Dbpedia.org
•
Query DBPedia.org to get relevant points of interest
surrounding Regent Street
•
•
Choose the type of points of interest (e.g. Monuments)
Display on a map
You can use Google Fusion Tables
© Fabio Ciravegna, University of Sheffield
•
63
Wednesday, 19 September 12
•
The point is not to implement everything today
•
The point is to draw a plan and show the details on
how you would do it
•
•
So to show you understood and know how to use the tools
•
Required:
•
Architecture, Twitter query, DBpedia query
•
Some implementation of the workflow
(At least partial) implementation (today or in the future)
is required
© Fabio Ciravegna, University of Sheffield
The whole point is...
64
Wednesday, 19 September 12
Prof. Dr. Fabio Ciravegna,
Director of Research and Innovation in the Digital World
University of Sheffield, United Kingdom
and
Department of Computer Science
University of Sheffield, United Kingdom
f.ciravegna@shef.ac.uk
Common work with Vita Lanfranchi, Neil Ireson, Annalisa Gentile, Sam Chapman, Paul Ridgeway, Duncan Grant,65
Ziqi Zhang and Elizabeth Cano Basave
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Situation awareness via social streams
•
All knowledge that is accessible and can be
integrated into a coherent picture to assess and
cope with a situation (Sarter 1991)
•
Wednesday, 19 September 12
Complete and reliable information elusive
© Fabio Ciravegna, University of Sheffield
Situation Awareness
Information Needs
Typical questions for SA in Emergency
Management:
•
Topics discussed
•
•
Trend analysis
Event identification
•
•
who/what/where/when
e.g. what areas of Haiti need what type of support? Are
there any likely survivors trapped under the rubbles?
© Fabio Ciravegna, University of Sheffield
•
67
Wednesday, 19 September 12
Social Streams and SA
Online forums, web sites, RSS feeds, Microblogging
•
Ubiquitous, rapid, cross platform medium of communication (Vieweg
2010)
•
Tweeters as sensors (Sakaki 2010)
•
They support self-organising behaviour that can produce
accurate results often in advance of official communication
(Vieweg 2010)
•
The Arab Spring uprisings
•
The vast majority of people got their information from social media sites (88% in Egypt and
94% in Tunisia)
•
Facebook usage swelled in the Arab region between January and April and sometimes
more than doubled
•
•
© Fabio Ciravegna, University of Sheffield
•
Number of users jumped by 30% to 27.7m, compared with 18% growth during the same
period in 2010
#1 Twitter hashtag in 2011 was #egypt
Wednesday, 19 September 12
68
Requirements
•
Monitoring in real time
•
Identifying relevant messages while getting rid of noise
•
Ascertaining reliability of source or information
•
Relating information to background knowledge
•
Understanding timeliness, geo- reference points
•
Monitoring specific groups/networks
•
Separating local information from external information (Vieweg
2010)
•
Getting messages in languages you understand
© Fabio Ciravegna, University of Sheffield
•
Analysis during the aftermath
•
Storing, searching, bookmarking, trend analysis
69
Wednesday, 19 September 12
A SimplifiedTask
Monitoring the development of a house fire
10:00 am
11:00 am
11:00 am
© Fabio Ciravegna, University of Sheffield
10:49 am
70
Wednesday, 19 September 12
A SimplifiedTask
Monitoring the development of a house fire
10:00 am
11:00 am
11:00 am
© Fabio Ciravegna, University of Sheffield
10:49 am
70
Wednesday, 19 September 12
A SimplifiedTask
Monitoring the development of a house fire
10:00 am
11:00 am
11:00 am
© Fabio Ciravegna, University of Sheffield
10:49 am
70
Wednesday, 19 September 12
A SimplifiedTask
Monitoring the development of a house fire
10:00 am
11:00 am
11:00 am
© Fabio Ciravegna, University of Sheffield
10:49 am
70
Wednesday, 19 September 12
A SimplifiedTask
Monitoring the development of a house fire
10:00 am
11:00 am
11:00 am
© Fabio Ciravegna, University of Sheffield
10:49 am
70
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Collective Intelligence
71
Wednesday, 19 September 12
Social Streams
•
•
Characteristics:
•
High in volume, and constantly increasing,
•
Often duplicated, incomplete, imprecise and incorrect
•
Written in informal style, extensive use of shorthand,
symbols (e.g., emoticons), misspellings etc.
•
Generally concerning the short-term zeitgeist
•
Covering every conceivable domain
Semantic Underspecification
•
Your life in 140 characters
•
The context is all important
Spammers are an everyday reality
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
72
The Web seen through Wikipedia
© Fabio Ciravegna, University of Sheffield
Barack Hussein Obama is
the 44th and current
President of the United
States. He is the first African
American to hold the office.
Obama served as a U.S.
Senator representing the
state of Illinois from January
2005 to November 2008,
when he resigned following
his victory in the 2008
presidential election.
73
Wednesday, 19 September 12
The Web seen through Wikipedia
© Fabio Ciravegna, University of Sheffield
Barack Hussein Obama is
the 44th and current
President of the United
States. He is the first African
American to hold the office.
Obama served as a U.S.
Senator representing the
state of Illinois from January
2005 to November 2008,
when he resigned following
his victory in the 2008
presidential election.
73
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
The Web through Social Streams
Twitter alone produces the equivalent in text of the 1.4m pages
in Wikipedia every two days
74
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
The Web through Social Streams
Twitter alone produces the equivalent in text of the 1.4m pages
in Wikipedia every two days
74
Wednesday, 19 September 12
All Examples are Real!
•
•
•
Bristol Carnival,
•
Bristol, UK, July 2011
•
User: emergency service control room (Gold Command)
© Fabio Ciravegna, University of Sheffield
•
Conservative Party Conference
•
Manchester, UK, October 2011
•
User: Emergency service control room (Gold Command)
Forward Defensive Exercise
•
London, February 2012
•
User: undisclosed
North East Floods, July 2012
•
User: SSSW 2012
75
Wednesday, 19 September 12
Linguistic issues
•
Short sentences
•
Alternative Language
Use of slang/swear words
•
Negatives
•
Conditional statements
•
Hope/prayer statements
•
Use of irony/sarcasm
•
Ambiguity
•
Unreliable capitalisation
•
Data sparsity (Saif 2012)
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
76
Sarcasm
•
Look for word combinations with opposite polarity,
e.g. “rain” or “delay” plus “brilliant”
Going to the dentist on my weekend home.
Great. I'm totally pumped
•
Inclusion of world knowledge / ontologies can help
•
e.g. knowing that people typically don't like going to
the dentist,
•
that people typically like weekends better than
weekdays
It's an incredibly hard problem and an area where we
expect not to get it right that often
A slide by Diana Maynard
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
77
© Fabio Ciravegna, University of Sheffield
Modelling Streams and Context
Wednesday, 19 September 12
Modelling Streams and Context
A tuple: <D, C, F >
•
D is a stream of documents d
•
C is a feature vector representing the set of contexts of documents
in four dimensions
c = < dWt, dWo, dWn, dWe >
•
F: function associating context to a document
dWt
F
dWn
dWe
dWo
D
Wednesday, 19 September 12
Lanfranchi 2012
79
© Fabio Ciravegna, University of Sheffield
•
The four dimensions
What (URIs or keywords)
•
•
•
•
The event reported and its entity participants
Who (URIs)
•
Author’s profiles e.g. their geographic location, age, interests,
groups and social connections
•
Agents (e.g. people and organisations mentioned in documents)
© Fabio Ciravegna, University of Sheffield
•
When
•
The time of post of a document
•
The time of an event mentioned in a post (past, present, future)
Where (URIs)
•
The location of the post and/or the location of the event
80
Wednesday, 19 September 12
What
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
What?
•
Identifying, classifying, clustering, ...
•
Events (and their sub-events)
•
Involved entities (including their URIs)
Solutions in literature
•
Natural Language Processing
•
A daunting task
•
•
jux lyk u do to myn...RT @qua_benah: Damn!! He is gonna flood myTL
with nonfa
Shallow(er) statistical methods
•
•
Wednesday, 19 September 12
A collection of terms:
•
Relevant Words (and their synonyms)
•
Relevant Tags (e.g. hashtags in Twitter)
•
Relevant References (e.g. proper names)
Relevant relations (generally shallow)
© Fabio Ciravegna, University of Sheffield
•
82
Term Relevance
Relevance in terms of
•
Domain relevance (Zhang 2011)
•
•
•
Match in pre-defined dictionary (e.g. geographic resource)
Semantic relatedness with terms in the domain/dictionary
Bursti-ness (Pohl 2012):
•
•
Frequency and relevance of term grow contextually in
different streams
Study of distribution of terms over time and streams
© Fabio Ciravegna, University of Sheffield
•
83
Wednesday, 19 September 12
84
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
84
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
84
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
84
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Assigning a URI to NE Terms
POS Accuracy
•
•
Parser Accuracy
•
•
•
Ritter et al. 2011: 0.88
Ritter: 0.875
Named Entity Recognition F(1)
•
Stanford NER: 0.44
•
Ritter: 0.67
Named Entity Classification F(1)
•
Stanford NER: 0.29
•
Ritter: 0.59
On standard texts
F(1)>90%
© Fabio Ciravegna, University of Sheffield
•
85
Wednesday, 19 September 12
Assigning a URI to NE Terms
POS Accuracy
•
•
Parser Accuracy
•
•
•
Ritter et al. 2011: 0.88
Ritter: 0.875
Named Entity Recognition F(1)
•
Stanford NER: 0.44
•
Ritter: 0.67
Named Entity Classification F(1)
•
Stanford NER: 0.29
•
Ritter: 0.59
On standard texts
F(1)>90%
© Fabio Ciravegna, University of Sheffield
•
85
Wednesday, 19 September 12
Assigning a URI to NE Terms
POS Accuracy
•
•
Parser Accuracy
•
•
•
Ritter et al. 2011: 0.88
Ritter: 0.875
Named Entity Recognition F(1)
•
Stanford NER: 0.44
•
Ritter: 0.67
Named Entity Classification F(1)
•
Stanford NER: 0.29
•
Ritter: 0.59
On standard texts
F(1)>90%
© Fabio Ciravegna, University of Sheffield
•
Charles, Prince of Wales
en.wikipedia.org/wiki/Charles,_Prince_of_Wales
85
Wednesday, 19 September 12
Assigning a URI to NE Terms
POS Accuracy
•
•
Parser Accuracy
•
•
•
Ritter et al. 2011: 0.88
Ritter: 0.875
Named Entity Recognition F(1)
•
Stanford NER: 0.44
•
Ritter: 0.67
Named Entity Classification F(1)
•
Stanford NER: 0.29
•
Ritter: 0.59
On standard texts
F(1)>90%
© Fabio Ciravegna, University of Sheffield
•
Charles, Prince of Wales
en.wikipedia.org/wiki/Charles,_Prince_of_Wales
85
Wednesday, 19 September 12
Assigning a URI to NE Terms
POS Accuracy
•
•
Parser Accuracy
•
•
•
Ritter et al. 2011: 0.88
Ritter: 0.875
Named Entity Recognition F(1)
•
Stanford NER: 0.44
•
Ritter: 0.67
On standard texts
F(1)>90%
Named Entity Classification F(1)
•
Stanford NER: 0.29
•
Ritter: 0.59
Sarah Ferguson ?
http://en.wikipedia.org/wiki/Sarah,_Duchess_of_York
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
Charles, Prince of Wales
en.wikipedia.org/wiki/Charles,_Prince_of_Wales
85
Assigning a URI to NE Terms
POS Accuracy
•
•
Parser Accuracy
•
•
•
Ritter et al. 2011: 0.88
Ritter: 0.875
Named Entity Recognition F(1)
•
Stanford NER: 0.44
•
Ritter: 0.67
On standard texts
F(1)>90%
Named Entity Classification F(1)
•
Stanford NER: 0.29
•
Ritter: 0.59
Sarah Ferguson ?
http://en.wikipedia.org/wiki/Sarah,_Duchess_of_York
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
Charles, Prince of Wales
en.wikipedia.org/wiki/Charles,_Prince_of_Wales
black eye
peas?85
Event detection
Major events (e.g. Flood) have sub-events
•
•
•
River burst at location XY
Identification of Sub-Events
•
User/Domain dictionary of potentially interesting events
•
Clustering of messages around bursty terms (Pohl 2012)
Reliability of events
•
Confirmation by reliable independent sources
•
200 people looting D&S Alahan's on 15 Coronation Street!
•
Survive the test of time
•
Methods of detection
•
See spammers in the who section
© Fabio Ciravegna, University of Sheffield
•
86
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Just a matter of feelings
87
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Just a matter of feelings
87
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Just a matter of feelings
87
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Just a matter of feelings
87
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Just a matter of feelings
87
Wednesday, 19 September 12
What and Post Relevance
Tao et al. 2012
Many messages may refer to a specific topic/event
•
•
Which are the really relevant ones?
Ranking features
•
Keywords
•
•
•
Semantic features
•
Semantic concepts overlapping with/related to domain concepts
•
Diversity of semantic concepts mentioned in post
•
Sentiment (positive/negative)
Syntactic
•
•
Overlapping with domain description
It contains an URL, a Hashtags, is a Reply, it is long
Context
•
Social context (# followers/followees, # posts),
•
Temporal context (how far away from now)
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
88
Who
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
Characterising Who’s
Tables from Cano 2012
User features
•
Content Features
i) message readability;
ii) topical complexity;
iii) entity complexity;
iv) structure complexity features
- use of links to classify topics/
users (Abel 2011)
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Message
Features
90
Meformers and other animals
Table from Cano 2012
Adapted from Naaman 2010
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Classifying users allows:
- understanding better
the message content
- Identifying changes in
behaviour
- May signal important
events
91
Modelling the Social Network
Semantically Enriched Communication Network (SECN)
•
•
A formal representation of individuals and their
communication exchanges
• User profiles: a set of topics, weighted according to
relevance to the user
• Similarities between users based on their profiles
A SECN is a typed, weighted graph:
•
Typed: nodes and edges within the graph are of several
different types
•
Weighted: types of edges can be assigned a weight to boost
importance of one type of connection or another
Lanfranchi - to appear
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
92
Spammers and Trolls
Spammers:
•
•
Send unsolicited bulk messages indiscriminately
Trolls:
•
Someone who posts inflammatory, extraneous, or offtopic messages
•
Including someone spreading false rumours during
emergencies
•
e.g. to create panic
•
To influence the international community (e.g. Bahrain protests)
© Fabio Ciravegna, University of Sheffield
•
93
Wednesday, 19 September 12
Profiling Spammers
Duplicate spammers and @spammers:
•
•
Malicious promoters
•
•
Post nearly identical tweets with or without links
sometimes addressing someone specifically (@user)
Post legitimate tweets interspersed with adverts
Friend infiltrators
•
Befriend users to receive reciprocal friendship and then
spam them
Kyumin Lee et al 2011
© Fabio Ciravegna, University of Sheffield
•
94
Wednesday, 19 September 12
Clever spammers
23,869 spammer studied
•
•
18,307 not detected by Twitter after 3 months
About the 18,307 “clever” spammers
•
Posting only 4 tweets a day
•
Linking to disreputable pages
•
•
Topic not mentioned in tweets
Abnormal number of followers/followees (>10,000)
•
•
High fluctuation in followers and followees (in terms of days)
Non-spammers max of 1,000 sf
•
Quite a stable set
Kyumin Lee et al 2011
© Fabio Ciravegna, University of Sheffield
•
95
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Identifying spammers
Kyumin Lee et al 2011
Wednesday, 19 September 12
96
© Fabio Ciravegna, University of Sheffield
Where on earth...
97
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Where on earth...
97
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Where on earth...
97
Wednesday, 19 September 12
Geo-Location
Needed to ascertain:
•
If message is relevant (several floods are happening at the
same time)
•
To identify the location of the issue/information reported
•
Can be
•
In metadata
•
•
Wednesday, 19 September 12
GPS information via e.g. mobile phone
In the document itself
•
Very detailed (“fire at 55 Surrey Street, Brighton”)
•
Rather generic (“Bridge opposite the Reed Deer”)
•
Incomplete (55 Surrey Street)
•
18-40% of messages related to emergencies contain references
•
78-86% of users have at least one message containing geo-information
•
Mainly depending on the type of hazard and phase (preparation Vs impact
phase) (Vieweg 2010)
© Fabio Ciravegna, University of Sheffield
•
98
WHERE
Location of a message Vs location of reference
•
The case of the Olympics: what messages come from the
ground?
•
GPS active device !
•
320m random tweets from Cumbria (UK) from start of 2011
•
•
1.5m tweets from London in 2012
•
•
•
~0.9% are geocoded
3.7% are geocoded
Flickr has large quantities of GPS tagged pictures
Hence extracting locations from post is a requirements
© Fabio Ciravegna, University of Sheffield
•
99
Wednesday, 19 September 12
Geo-Location Approach
Lists of points of interests taken from Linked Open Data
•
•
e.g. data.gov.uk, OpenStreetMap, etc.
String metrics to identify candidate terms
•
Social streams tend to be highly ambiguous
•
•
© Fabio Ciravegna, University of Sheffield
•
There is a London Road in every single town in England (lad also a Wall Street)
Disambiguation based on context
•
User, document, point of interest
• User locations, terms, terms related to the point of interest, related users, etc.
•
Approach:
•
Shannon’s information entropy (Ireson 2010)
100
Wednesday, 19 September 12
Reported
0
2
4
6
8
10
12
When?
Wednesday, 19 September 12
14
16
© Fabio Ciravegna, University of Sheffield
Event
When
•
Time of posting
•
In metadata
•
May or may not be the time of the event (e.g. RT)
Time in the message body and implications
•
Relevant to event but must be balanced with the linguistic
form
•
•
River Don burst into Meadowhall at half past eight
River Don flooded. They say tomorrow it will stop
•
•
I wish it stopped raining and the flood stopped as well
•
•
The flood has not stopped
Flood flood flood
•
Wednesday, 19 September 12
Tomorrow is not the time of the event
Is it now or...
© Fabio Ciravegna, University of Sheffield
•
102
Post Time and Event time
Timestamp is not necessarily the correct one
•
There is always a lag between the event and the
appearance on social streams
•
•
•
Due to the need to react
Retweets happen several hours after events
Different events have different lag time
Event
Reported
0
2
4
6
8
10
12
14
16
© Fabio Ciravegna, University of Sheffield
•
103
Wednesday, 19 September 12
Situation Awareness
•
Typically interested in the present
•
It happens
•
It has just happened
•
It is about to happen
Focus on a timeline
•
•
•
Expectations/distant past/distant future are of limited use to put
on a timeline
Focus on messages with immediate temporal
frame expressions
•
An easier task than positioning on a map everything
•
Heavily influenced by the time of posting
Approach: pattern matching/statistical classification
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
104
© Fabio Ciravegna, University of Sheffield
Tracking Real Time Intelligence in Data Streams
Wednesday, 19 September 12
TRIDS
Tracking Real Time Intelligence in Data Streams
•
•
A modular architecture enabling SA via Social Streams
•
During preparation, emergency and aftermath
•
Event identification
•
Multi-dimensional presentation (spatial, temporal)
© Fabio Ciravegna, University of Sheffield
•
Back-end
•
Gathering, storing and processing information
•
Cloud Computing on a distributed multi-core archive
•
Data analytics based on large scale NLP technologies
Front end
•
Visual analytics based on multi-dimensional visualisation
106
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
TRIDS architecture
107
Wednesday, 19 September 12
Processing Messages
Integration of several streams into one funnelling point
•
•
Metadata extraction
•
•
Uniform representation of document regardless the stream it comes
from
Users, mentions, hashtags, keywords, GPS-coordinates
Message Augmentation
•
User profiling (network, location, timeline, entropy, me/informer, etc.)
•
Content-based geo location
•
•
© Fabio Ciravegna, University of Sheffield
•
Names: gazetteer based IE based on large scale geo-LoD (Ireson 2010)
Trends and profiling
•
Points of view profiling (Cano Basave 2012)
108
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
The user in the loop
Wednesday, 19 September 12
User Role
•
User can (must) be in the loop
•
Users can easily disambiguate and add context
•
Legal responsibility or Em Serv
User must be enabled
to ascertain facts and
events
•
With the maximum of information possible
•
Focussing on events without losing the big picture or
other important events
© Fabio Ciravegna, University of Sheffield
•
110
Wednesday, 19 September 12
•
A tuple: <D, C, F, V, T >
•
D is a stream of documents d
•
C is a feature vector representing the set of contexts of documents
in four dimensions
c = < dWt, dWo, dWn, dWe >
•
F: function associating context to a document
•
V is a set of visualisation strategies characterised by dimensionbased visualisation widgets vWt, vWo, vWn, vWe
•
T is the definition of the user information needs,
•
© Fabio Ciravegna, University of Sheffield
Extending the Formalism
a task t T is a tuple: <v, tWt, tWo, tWn, tWe> defining the visualisation
strategy and the constraints on the four dimensions
•
e.g. specifying a set of terms/tags/concepts (tWt), document
authors, groups or types (tWo), timeframe (tWn) and geo-locations
(tWe)
Lanfranchi 2012
Wednesday, 19 September 12
24
111
Context-based Visual Analytics
Where-based visualisation
•
Different levels of details with a clustering algorithm for
aggregating data and displaying frequency counts
•
Multiple visual parameters to map different attributes
•
© Fabio Ciravegna, University of Sheffield
•
Different colours are associated with the density of information
112
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Who
Wednesday, 19 September 12
Focused trends detection
What-based Visualisation
•
Tag-cloud based visualisation to display trending topics based on
specific user interests
•
© Fabio Ciravegna, University of Sheffield
•
Several dimensions to calculate trends and display them in multiple widgets
ConceptCloud
114
Wednesday, 19 September 12
Temporal Trending
When-based visualisation
•
•
A time series visualisation combining column and areas plotting to
provide the frequency distribution of selected social media
messages.
© Fabio Ciravegna, University of Sheffield
•
The timeline is dynamic, customisable via facets
115
Wednesday, 19 September 12
Searching in the aftermath
•
For
•
Debriefing and lesson learnt sessions,
•
Documents explaining actions, decisions
and storing eventual evidence
•
Summary views and details of specific
actions
© Fabio Ciravegna, University of Sheffield
•
Hybrid search interface for mixed
keyword and semantic search
•
Multiple spatio-temporal views over the
available knowledge
116
Wednesday, 19 September 12
•
Tasked to “listen for events” over 72 hours
•
21st, 22nd and 23rd of February 2012
•
Operators
•
•
Three operators analysing the data streams in real time, focussing
their analysis on specific events and their evolution, these analysts
were working at different physical locations;
•
An operator working on the collected data to do post-event
analysis
© Fabio Ciravegna, University of Sheffield
Forward Defensive
3 screens each
117
Wednesday, 19 September 12
118
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
© Fabio Ciravegna, University of Sheffield
Geo Tagging
127
Wednesday, 19 September 12
• Manchester,
• Event
• User:
UK, October 2011
spotting and riot prevention
Emergency service control room
• Gold Command
© Fabio Ciravegna, University of Sheffield
Conservative Party Conference
128
Wednesday, 19 September 12
Emergency Response – Ci=zen Awareness
Wednesday, 19 September 12
Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited
Understanding -­‐ Tracking protester movements
Emergency Response – Ci=zen Awareness
Wednesday, 19 September 12
Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited
Understanding -­‐ Tracking protester movements
Emergency Response – Ci=zen Awareness
Wednesday, 19 September 12
Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited
Cri=cal informa=on – police targeted
• Directly targeted individuals
• Protestor networking police posi=ons and movements
• Encouraged direct ac=on
Emergency Response – Ci=zen Awareness
Wednesday, 19 September 12
Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited
Cri=cal informa=on – police targeted
Emergency Response – Ci=zen Awareness
Wednesday, 19 September 12
Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited
Understanding -­‐ Rumours spreading
Emergency Response – Ci=zen Awareness
Wednesday, 19 September 12
Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited
Discover & understanding -­‐ Unexpected ac5vi5es
Emergency Response – Ci=zen Awareness
Wednesday, 19 September 12
Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited
Discover & understanding -­‐ Unexpected ac5vi5es
Conclusions
It is a messy messy world
•
•
We are asked to address all the most difficult natural language
problems at the same time
•
Linguistic analysis
•
Contextual analysis
•
Common sense reasoning
© Fabio Ciravegna, University of Sheffield
•
Why not use Wikipedia, then?
135
Wednesday, 19 September 12
Conclusions (2)
We can do quite well in Situation Awareness even now
•
•
The user in the loop addresses several issues complex for the
technology
© Fabio Ciravegna, University of Sheffield
•
Huge issues in terms of privacy and legality
•
X: Can you follow people?
•
Y: Yes but it is against the law
•
X: This is our problem, not yours. We abide all the legislation of
the countries we keep our data
•
136
Wednesday, 19 September 12
Conclusions (2)
We can do quite well in Situation Awareness even now
•
•
The user in the loop addresses several issues complex for the
technology
© Fabio Ciravegna, University of Sheffield
•
Huge issues in terms of privacy and legality
•
X: Can you follow people?
•
Y: Yes but it is against the law
•
X: This is our problem, not yours. We abide all the legislation of
the countries we keep our data
• Y:
•
Where do you keep your data about the UK?
X: Saudi Arabia
136
Wednesday, 19 September 12
f.ciravegna@shef.ac.uk
http://www.dcs.shef.ac.uk/~fabio/
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Thank You
Bibliography
•
Elizabeth Cano Basave: Capturing And Modelling Entity Perception In Social Streams, PhD thesis, University of Sheffield, October 2012
•
Mor Naaman, Jeff Boase and Chih-Hui Lai. Is it Really About Me? Message Content in Social Awareness Streams. In Proceedings, the
2010 ACM Conference on Computer Supported Cooperative Work (CSCW 2010), February 2010, Savannah, Georgia.
© Fabio Ciravegna, University of Sheffield
•
138
Wednesday, 19 September 12
Your Exercise
Create a program that enables situation awareness
•
Identify a situation awareness scenario (e.g. fire) and their
associated keywords (e.g. fire brigade)
•
Create a Java program that:
•
Queries Twitter and retrieves relevant posts
•
Saves tweets in a database (e.g. mySql or - see later - Google FusionTables
•
Extract information on names by sending tweets to Zemanta.org
•
•
•
E.g. Identify references to where and who
Query DbPedia.org to get further references “around the entities mentiones”
•
Note: you should have a clear plan on what “around” means
•
You will have to design a very small ontology based on the types returned by Zemanta
Display tweets and related concepts in a relevant way, for example using
maps for where and related concepts
© Fabio Ciravegna, University of Sheffield
•
139
Wednesday, 19 September 12
Example: crashes
Identify the scenario (e.g. road crashes)
• Query Twitter for relevant messages
• Devise methods to classify relevant messages - for example use regular
expressions such as
• <xxx> crash(ed) … on ...
© Fabio Ciravegna, University of Sheffield
•
140
Wednesday, 19 September 12
Zemanta: Output Example
Link:
anchor=Regent Street
confidence=0.862714
entity_type=
Target:
url=http://maps.google.com/maps?ll=51.5108,-0.1387&spn=0.01,0.01&q=51.5108,-0.1387 (Regent%20Street)&t=h
title=Regent Street
© Fabio Ciravegna, University of Sheffield
type=geolocation
141
Wednesday, 19 September 12
Dbpedia.org
•
Query DBPedia.org to get relevant points of interest
surrounding Regent Street
•
•
Choose the type of points of interest (e.g. Monuments)
Display on a map
You can use Google Fusion Tables
© Fabio Ciravegna, University of Sheffield
•
142
Wednesday, 19 September 12
•
The point is not to implement everything today
•
The point is to draw a plan and show the details on
how you would do it
•
•
So to show you understood and know how to use the tools
•
Required:
•
Architecture, Twitter query, DBpedia query
•
Some implementation of the workflow
(At least partial) implementation (today or in the future)
is required
© Fabio Ciravegna, University of Sheffield
The whole point is...
143
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Some techie details
144
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Twitter API
Wednesday, 19 September 12
Twitter API
http://net.tutsplus.com/tutorials/other/diving-into-the-twitter-api/
• There
•
The normal REST based API
•
methods constitute the core of the Twitter API, and are written by Twitter itself. It
allows other developers to access and manipulate all of Twitter’s main data.
•
You’d use this API to do all the usual stuff you’d want to do with Twitter including
retrieving statuses, updating statuses, showing a user’s timeline, sending direct
messages and so on.
The Search API
•
•
The Stream API
•
•
Lets you look beyond you and your followers. You need this API if you are
looking to view trending topics and so on.
lets developers sample huge amounts of real time data. Since this API is only
available to approved users, we aren’t going to go over this today.
Public data can be freely accessed without an API key. When requesting
private data and/or user specific data, Twitter requires authentication
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
•
are three separate Twitter APIs actually.
http://dev.twitter.com/pages/every_developer
The API (ctd)
are limits to how many calls and changes you can
http://dev.twitter.com/pages/rate-limiting
make in a day
•
API usage is rate limited with additional fair use limits to protect Twitter
from abuse.
• The
API is entirely HTTP-based
•
Methods to retrieve data from the Twitter API require a GET request.
Methods that submit, change, or destroy data require a POST.
•
API Methods that require a particular HTTP method will return an error if
you do not make your request with the correct one.
•
HTTP Response Codes are meaningful
• The
API presently supports the following data formats: XML,
JSON, and the RSS and Atom syndication formats, with
some methods only accepting a subset of these formats.
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• There
REST API Methods
http://apiwiki.twitter.com/w/page/22554679/Twitter-API-Documentation
Methods
•
statuses/public_timeline
•
statuses/home_timeline
•
statuses/friends_timeline
•
statuses/user_timeline •
statuses/mentions
•
statuses/retweeted_by_me
•
statuses/retweeted_to_me
•
statuses/retweets_of_me
• And
several others!!!!
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Timeline
Interacting with Twitter in Java
is an unofficial Java library for the Twitter API.
•
You can easily integrate Java application with the Twitter service
•
Twitter4J is featuring:
•
100% Pure Java - works on any Java Platform version 1.4.2 or later
•
Android platform and Google APP Engine ready
•
Zero dependency : No additional jars required
•
Built-in OAuth support
•
Out-of-the-box gzip support
• Just
download and add its jar file to the application
classpath.
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Twitter4J
http://twitter4j.org
An Example
tweets from people in Sheffield about Sheffield
•
People in Sheffield == geolocated in Sheffield
•
About Sheffield == using #Sheffield
•A
number of examples at
https://github.com/yusuke/twitter4j/tree/master/twitter4j-examples/src/main/java/twitter4j/examples
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Get
© Fabio Ciravegna, University of Sheffield
package tryno.server;
import twitter4j.*;
import java.util.List;
public class GetTweetByLocation {
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
public String getSimpleTimeLine(){
Twitter twitter = new TwitterFactory().getInstance();
String resultString= "";
try {
// it creates a query and sets the geocode
//requirement
Query query= new Query("#sheffield");
query.setGeoCode(new GeoLocation(53.383, -1.483), 2,
Query.KILOMETERS);
//it fires the query
QueryResult result = twitter.search(query);
(ctd)
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
(ctd from previous slide)
//it cycles on the tweets
List<Tweet> tweets = result.getTweets();
for (Tweet tweet : tweets) {
///gets the user
User user = twitter.showUser(tweet.getFromUser());
Status status= (user.isGeoEnabled())?user.getStatus():null;
if (status==null)
resultString+="@" + tweet.getFromUser() + " ("
+ user.getLocation()
+ ") - " + tweet.getText();
else resultString+="@" + tweet.getFromUser() + " (" + ((status!
=null&&status.getGeoLocation()!=null)?
status.getGeoLocation().getLatitude()+",
"+status.getGeoLocation().getLongitude():user.getLocation())
+ ") - " + tweet.getText();
}
} catch (Exception te) {
te.printStackTrace(); System.out.println("Failed to search tweets:
" + te.getMessage());
System.exit(-1);
}
return resultString;
}
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
public static void main(String[] args) {
GetTweetByLocation tt= new GetTweetByLocation();
String aa=tt.getSimpleTimeLine();
System.out.println(aa);
}}
Wednesday, 19 September 12
Output
@eatSheffield (Sheffield) - RT @barandgrillshef: #Sheffield if you had to order a cocktail what would it be, or would you just like a cup from
@YorkshireTea ?@barandgrillshef (Leopold Square, Sheffield) - #Sheffield if you had to order a cocktail what would it be, or would you just like a cup
from @YorkshireTea ?
@Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield
@Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield
@barandgrillshef (Leopold Square, Sheffield) - Fancy relaxing on the beach #sheffield http://www.youtube.com/watch?v=Dax5Sbt20sA we'll see you there
@barandgrillshef (Leopold Square, Sheffield) - #Sheffield #Cloudy according to the BBC http://news.bbc.co.uk/weather/forecast/353 hows your day?
@barandgrillshef (Leopold Square, Sheffield) - #mothersday april 3 any plans #sheffield ? why not book a table now
www.barandgrillsheffield.co.uk/mothers-day/]
http://
@Kineets (sheffield) - @shefgossip what's all the factor lot doing here @katiewaissel24 checked in #sheffield an hour ago?
@aryayuyutsu (53.382419,-1.478586) - RT @SheffieldStar 400 workers lose job as firm closes down in #Chesterfield http://bit.ly/hpX8NK (#Sheffield)
© Fabio Ciravegna, University of Sheffield
@CFMDsFMKX (Sheffield Hallam University) - We're teaching today at #sheffieldhallam #sheffield on our UG programme in #facilitiesmanagement on Managing
Premises & The Work Environment
@Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield
@Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield
@Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield
@aryayuyutsu (53.382419,-1.478586) - Off for the final night of a most ROTFL-ing and LOL-ing and LMAO-ing #ComedyFestival 2011. I voted for the amazing
#Thünderbards! #Sheffield
@Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield
Wednesday, 19 September 12
Google Fusion Tables
http://www.google.com/fusiontables/public/tour/index.html
• Visualize
and publish your data as maps, timelines and
• Host
•
your data tables online
On your google account
• Combine
•
data from multiple people
By joining two tables
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
charts
Google Fusion API
•
It enables programmatic access to Google Fusion Tables content.
•
It is an extension of Google's existing structured data capabilities
for developers. Here are some of the things you can do:
•
Upload: populating a table in Google Fusion Tables with data, from a single row to
hundreds at a time.
•
The data can come from a variety of sources, such as a local database, .CSV file,
data collection form, or mobile device.
•
Query and download: it is built on top of a subset of the SQL querying language. By
referencing data values in SQL-like query expressions, you can find the data you
need, then download it for use by your application. Your app can do any desired
processing on the data, such as computing aggregates or feeding into a
visualization gadget.
•
Sync: As you add or change data in the tables in your offline repository, you can
ensure the most up-to-date version is available to the world by synchronizing those
changes up to Google Fusion Tables
Tutorials at http://www.google.com/support/fusiontables/bin/answer.py?answer=184641
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
http://code.google.com/apis/fusiontables/
Creating a table
© Fabio Ciravegna, University of Sheffield
http://code.google.com/apis/fusiontables/docs/developers_guide.html#WritingApps
Wednesday, 19 September 12
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
© Fabio Ciravegna, University of Sheffield
Adding rows
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
ctd
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Geographic locations
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Expressing GeoLocations
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
The Code: FusionTablePreparation.java
You must have a
google account
Wednesday, 19 September 12
"sql=CREATE TABLE TweetTimeLine (user: STRING, tweet: STRING, whereLoc: LOCATION
"sql=INSERT INTO <tableId> (user, tweet, whereLoc)
VALUES ('<user>','<the tweet text>',
'<Point><coordinates><lat>,<long></coordinates></Point>')
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Creating a table/insert row
They are two forms of SQL INSERT
© Fabio Ciravegna, University of Sheffield
runInsert: it executes all the insertion operations
Wednesday, 19 September 12
Example of accessing twitters using twitter4j.
Slightly modified version of the one presented
during the lecture
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
GetTweetByLocation.java
© Fabio Ciravegna, University of Sheffield
What we want to create
Wednesday, 19 September 12
The Main
It creates the table with 3 columns
It inserts the tweets into the table
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
insert username and password
Instructions
your username and password in the main of
FusionTablePreparation
• Run
•
• Got
the class FusionTablePreparation
This will access tweet and will generate a table in your Google Fusion Tables
space at http://www.google.com/fusiontables/Home?pli=1 (you must login into
google to access that)
to your fusion table space
•
Click on “Owned by Me”
•
Click on the table TweetTimeLine
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
• Insert
ctd
will see the tweet table
© Fabio Ciravegna, University of Sheffield
• You
• Select
Wednesday, 19 September 12
Visualize -> Map
© Fabio Ciravegna, University of Sheffield
Result
Wednesday, 19 September 12
Jena
173
¡ A
Java framework for building Semantic Web
applications.
It provides a programmatic environment for RDF, RDFS and
OWL, SPARQL and includes a rule-based inference engine.
¡ Jena
is open source and it includes
RDF API
An OWL API
In-memory and persistent storage
SPARQL query engine
Wednesday, 19 September 12
incubator.apache.org/jena/
© Fabio Ciravegna, University of Sheffield
Reading and writing RDF in RDF/XML, N3 and N-Triples
SPARQL in Jena
SPARQL
174
Wednesday, 19 September 12
http://www.ibm.com/developerworks/xml/library/j-sparql/#2
SPARQL in Jena
•
SPARQL queries are created and executed with Jena
via classes in the com.hp.hpl.jena.query package
•
QueryFactory is the simplest approach.
QueryFactory has various create() methods to read a
textual query from a file or from a String.
•
create() methods return a Query object, which encapsulates a
parsed query.
Then create an instance of QueryExecution, a class
that represents a single execution of a query.
•
Call QueryExecutionFactory.create(query, model),
•
passing in the Query to execute and the Model to run it against.
•
Note: the query does not need a FROM clause.
© Fabio Ciravegna, University of Sheffield
•
Model= Ontology
Model in Jena 175
Wednesday, 19 September 12
QueryExecution
There are several different execute methods on
QueryExecution, each performing a different type of
query
•
For a simple SELECT query, call execSelect(), which returns a
ResultSet.
•
•
ResultSet allows you to iterate over each QuerySolution returned by the
query, providing access to each bound variable's value.
Alternatively, ResultSetFormatter can be used to output query
results in various formats.
© Fabio Ciravegna, University of Sheffield
•
176
Wednesday, 19 September 12
import com.hp.hpl.jena.query.*;
public class jenaToDBpedia {
public static void main(String[] args) {
String queryString=
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-­‐schema#>"+
"PREFIX type: <http://dbpedia.org/class/yago/>"+
"PREFIX prop: <http://dbpedia.org/property/>"+
"SELECT DISTINCT ?country_name ?country "+
"WHERE {"+
" ?country a type:LandlockedCountries. "+
" ?country rdfs:label ?country_name. "+
" ?country prop:populationEstimate ?population ."+
" FILTER (?population > 15000000 " +
" && langMatches(lang(?country_name), 'EN')) ." +
"}";
jenaToEndPoint jdp= new jenaToEndPoint();
jdp.queryEndPoint(queryString, "http://dbpedia.org/sparql"); }}
Again this code is simplified
Never write SPARQL queries directly in
the code!!!
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Querying DBpedia in Jena
177
Querying DBpedia in Jena
public void queryEndPoint(String queryString, String endPoint){
// now creating query object
Query query = QueryFactory.create(queryString);
// initializing queryExecution factory with remote service.
QueryExecution qexec = QueryExecutionFactory.sparqlService(endPoint, query);
//query execution and result processing
try {
//simple select
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out, results, query);
}
finally {
qexec.close(); }
}
}
© Fabio Ciravegna, University of Sheffield
import com.hp.hpl.jena.query.*;
public class jenaToEndPoint {
178
Wednesday, 19 September 12
•
Interface ResultSet
•
Method Summary
Cycle on the solutions (of type QuerySolution)
http://jena.sourceforge.net/ARQ/javadoc/com/hp/hpl/jena/query/ResultSet.html
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Getting the results
179
http://jena.sourceforge.net/ARQ/javadoc/com/hp/hpl/jena/query/QuerySolution.html
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Iterate on all variables and return value as
Resource or as Literal
180
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Zemanta
181
•
Zemanta analyzes user-generated content (e.g. a blog
post) using natural language processing and semantic
search technology to suggest pictures, tags and links
to related articles.
•
Zemanta suggests content from Wikipedia, YouTube,
IMDB, Amazon.com, Crunchbase, Flickr, ITIS,
Musicbrainz, Mybloglog, Myspace, NCBI, Rotten
Tomatoes,Twitter, Facebook, Snooth and Wikinvest,
as well as the blogs of other Zemanta users.
© Fabio Ciravegna, University of Sheffield
Zemanta.com
182
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Calling Zemanta
183
Wednesday, 19 September 12
© Fabio Ciravegna, University of Sheffield
Getting the results
184
Wednesday, 19 September 12
Download