© Fabio Ciravegna, University of Sheffield Harnessing Social Streams Professor Fabio Ciravegna Department of Computer Science, University of Sheffield fabio@dcs.shef.ac.uk 1 Picture from http://thesocietypages.org/sociologylens/tag/nathan-jurgenson/ Wednesday, 19 September 12 Web 2.0 1991 1995 2000 2005 2009 2005 Jan 2001 Wikipedia 2009 2006 1999 1999 August 2002 April 2003 Apple’s iTunes: bound to become the largest store in the world for digital music (175 million songs in first quarter of 2007) © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 2004 facebo Source: BBC 2005 YouTube Web 2.0: Trends • From connecting documents to connecting people • User trends • Harnessing collective intelligence • Trusting users as co-developers • Leveraging the long tail through customer self-service © Fabio Ciravegna, University of Sheffield http://oreilly.com/web2/archive/what-is-web-20.html http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html Wednesday, 19 September 12 3 Collective Intelligence • Collective intelligence is a shared or group intelligence that emerges from the collaboration and competition of many individuals intelligence appears in a wide variety of forms of consensus decision making in bacteria, animals, humans, and computer networks • According to Don Tapscott and Anthony D. Williams, collective intelligence is mass collaboration. • In order to happen, four principles need to exist • openness, peering, sharing and acting globally. © Fabio Ciravegna, University of Sheffield • Collective Definitions from Wikipedia 4 Wednesday, 19 September 12 Harnessing collective intelligence • is the foundation of the web. As users add new content, and new sites, it is bound in to the structure of the web by other users discovering the content and linking to it. • Much as synapses form in the brain, with associations becoming stronger through repetition or intensity, the web of connections grows organically as an output of the collective activity of all web users. • Google's breakthrough in search was PageRank • A method of using the link structure of the web rather than just the characteristics of documents to provide better search results. • What is popular is important • Many hyperlinks reaching the page • Important hyperlinks reaching the page Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Hyperlinking Harnessing collective intelligence Let people tell you what they know/what they like • Social bookmarking © Fabio Ciravegna, University of Sheffield • StumbleUpOn Wednesday, 19 September 12 Harnessing Collective Intelligence: Wikipedia an online encyclopaedia based on the unlikely notion that an entry can be added by any web user, and edited by any other, • A radical experiment in trust • Eric Raymond's dictum (originally coined in the context of open source software) "with enough eyeballs, all bugs are shallow," • Wikipedia is already in the top 10 websites. This is a profound change in the dynamics of content creation! Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Wikipedia, © Fabio Ciravegna, University of Sheffield In principle: yet another encyclopaedia But you can edit the content! The collective intelligence kicks in Wednesday, 19 September 12 8 © Fabio Ciravegna, University of Sheffield Discussion 9 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield The history to track changes (trust the users but...) 10 Wednesday, 19 September 12 Harnessing: The Mechanical Turk (amazon.com) web service that Amazon is calling 'artificial artificial intelligence.' • If you need a process completed that only humans can do given current technology (judgment calls, text drafting or editing, etc.), • you can simply make a request to the service to complete the process. • The machine will then complete the task with volunteers, and return the results to your software." https://www.mturk.com/mturk/welcome Image from wikipedia Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield •A Other examples Answers © Fabio Ciravegna, University of Sheffield • Yahoo Wednesday, 19 September 12 Collective Intelligence: Openness • During the early ages of the communications technology, people and companies are reluctant to share ideas, intellectual property and encourage self-motivation The reason for this is these resources provide the edge over competitors • Now people and companies tend to loosen hold over these resources because they reap more benefits in doing so • By allowing others to share ideas and bid for franchising, their products are able to gain significant improvement and scrutiny through collaboration © Fabio Ciravegna, University of Sheffield • 13 Wednesday, 19 September 12 Collective Intelligence: Peering is a form of horizontal organisation with the capacity to create information technology and physical products • The ‘opening up’ of the Linux program where users are free to modify and develop it provided that they made it available for others. • Participants in this form of collective intelligence have different motivations for contributing, • • but the results achieved are for the improvement of a product or service “Peering succeeds because it leverages self-organisation • a style of production that works more effectively than hierarchical management for certain tasks” © Fabio Ciravegna, University of Sheffield • This 14 Wednesday, 19 September 12 Collective Intelligence: Sharing and more companies have started to share some intellectual property, while maintaining some degree of control over others, like potential and critical patent rights • This is because companies have realised that by limiting all their intellectual property, they are shutting out all possible opportunities • Sharing some has allowed them to expand their market and bring products out more quickly Google with their free services and Apple’s App store are examples of opening up part of the IP so to allow (nearly) everyone to contribute © Fabio Ciravegna, University of Sheffield • More 15 Wednesday, 19 September 12 Collective Intelligence: Acting Globally emergence of communication technology has prompted the rise of global companies, or e-Commerce • E-Commerce has allowed individuals to set up businesses at almost no or low overhead costs • As the influence of the Internet is widespread, a globally integrated company would have no geographical boundaries • Music industry is ubiquitous, you can sell from everywhere • • Selling a dishwasher is more complex They would also have global connections, allowing them to gain access to new markets, ideas and technology © Fabio Ciravegna, University of Sheffield • The 16 Wednesday, 19 September 12 5 Great Ways of Harnessing Collective Dion Hinchcliffe Intelligence http://www.kreeo.com/#cbok/Collective_Intelligence/contents • The Hub of A Hard To Recreate Data Source This is a classic Web 2.0 concept and success here often devolves to being the first entry with an above average implementation. • Wikipedia, eBay, and others which are almost entirely the sum of the content their users contribute. • Just be careful and avoid crowded niches, like peer production news. The Wikipedia content: from competitive advantage to common heritage Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Be 5 Great Ways of Harnessing Collective Dion Hinchcliffe Intelligence http://www.kreeo.com/#cbok/Collective_Intelligence/contents • Collective Intelligence Out Google uses hyperlink analysis to determine the relevance of a given page and builds its own database of content which it then shares through its search engine. • Not only does this approach completely avoid a dependency on the ongoing kindness of strangers it also lets you build a very big content base from the outset. Do you think that this ad will work? (note this is a spoof ad) Wednesday, 19 September 12 A personal appeal from Microsoft founder Bill Gates Help Bing! index pages © Fabio Ciravegna, University of Sheffield • Seek 5 ways (ctd) Large-Scale Network Effects • This is arguably harder to do than either of the methods above • Small examples can be found in things like the Million Dollar Pixel Page. • Probably not very repeatable, but when they happen, they can happen big. Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Trigger SMS, Web, Email, Radio, Phone, Twitter, Facebook, Television, List-serves, Live streams, Situation Reports Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield http://haiti.ushahidi.com/ 5 ways (ctd) a Reverse Intelligence Filter • The idea is that hyperlinks, trackbacks, and other information references can be counted and used as a reference to determine what it's important. • The blogosphere is the greatest example of this • Combined with temporal filters and other techniques and you can create situation awareness engines easily. • It sounds similar to #2 but it's different in that it can be used with or without external data sources and is aimed not at finding but at eliding the irrelevant altogether as an active filter. Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Create Trusting users as co-developers • Let people play with your data • Let people play with your software © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 Involving users as co-developers: sells the same products as competitors such as Barnesandnoble.com, • They receive the same product descriptions, cover images, and editorial content from their vendors. • What is the difference, then? • Barnesandnoble.com search is likely to lead with the company's own products, or sponsored results • Amazon has made a science of user engagement. • They have an order of magnitude more user reviews, invitations to participate in varied ways on virtually every page • They use user activity to produce better search results. • Wednesday, 19 September 12 Amazon always leads with "most popular", a real-time computation based on • Sales • "Flow" around products © Fabio Ciravegna, University of Sheffield • Amazon © Fabio Ciravegna, University of Sheffield Involving users as developers: amazon.co.uk 24 Wednesday, 19 September 12 amazon.co.uk as Web 2.0 and As old Shop Keeper Instruction to publishers to provide additional service (advantage to both: cooperation to sell more) details about the service and offers (standard e-commerce services) more details about the product (standard service) cluster interests based on past experience (like the old shop keeper) 25 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Look Inside like in a shop the judgment is shared by the users You can judge by yourself about the quality by the comment (you may disagree) the community helps judge the quality of the comment Comments on the comments Comments are published positive or negative (or so it seems!) 26 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Customer for customers Amazon.com: Personalised Services • This is my space © Fabio Ciravegna, University of Sheffield • I can attach my wishes to it • my friends can see it and send me a present • a mediator between people • like the old shop keepers who know the wishes of their customers 27 Wednesday, 19 September 12 Supporting community life • Let people upload their data and share it • Creating, supporting and free communities © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 The Social Web The social web can be described as people interlinked and interacting with engaging content in a conversational and participatory manner via the WWW • • people are brought together through a variety of shared interests. Since social web applications are built to encourage communication between people, they typically emphasise some combination of the following social attributes: • Identity: who are you? • Reputation: what do people think you stand for? • Presence: where are you? • Relationships: who are you connected with? who do you trust? • Groups: how do you organize your connections? • Conversations: what do you discuss with others? • Sharing: what content do you make available for others to interact with? Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • http://en.wikipedia.org/wiki/Social_Web Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Or in short... The social web (ctd) • Two • "people focus": the person is the focus of social interaction • A profile is constructed by each user • e.g. Bebo, Facebook, and Myspace. "hobby focus" • e.g. photography websites such as Flickr, Kodak Gallery and Photobucket • Pinterest is the new hobby-related site Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • types of Web sites: 2010 http://royal.pingdom.com/2010/02/16/study-ages-of-social-network-users/ Wednesday, 19 September 12 From Wikipedia • A "folksonomy" is a collaboratively generated, open-ended labeling system that enables Internet users to categorize content such as Web pages, online photographs, and Web links. • The freely chosen labels -- called tags -- help improve search engine's effectiveness because content is categorized using a familiar, accessible, and shared vocabulary. • The labeling process is called tagging. • Because folksonomies develop in Internet-mediated social environments, users can discover (generally) who created a given folksonomy tag, and see the other tags that this person created. • In this way, folksonomy users often discover the tag sets of another user who tends to interpret and tag content in a way that makes sense to them. • The result, often, is an immediate and rewarding gain in the user's capacity to find related content. Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Folksonomy Tag Cloud (or Folksonomy) •Part of the appeal of folksonomy is its inherent subversiveness: © Fabio Ciravegna, University of Sheffield • Faced with the dreadful performance of the search tools that Web sites typically provide, • Folksonomies can be seen as a rejection of the search engine status quo in favor of tools that are both created by the community and beneficial to the community. •Folksonomies arise in Web-based communities where special provisions are made at the site level for creating and using tags. •These communities are established to enable Web users to label and share user-generated content, such as photographs, or to collaboratively label existing content, such as Web sites, books, works in the scientific and scholarly literatures, and blog entries. 34 Wednesday, 19 September 12 Blogging and the Wisdom of Crowds •One of the most highly touted features of the Web 2.0 era is the rise of blogging. © Fabio Ciravegna, University of Sheffield •Personal home pages have been around since the early days of the web, and the personal diary and daily opinion column around much longer than that, so just what is the fuss all about? •At its most basic, a blog is just a personal home page in diary format. But as Rich Skrenta notes, the chronological organization of a blog "seems like a trivial difference, but it drives an entirely different delivery, advertising and value chain." 35 Wednesday, 19 September 12 Typing What I'm Thinking To Everyone Reading Twitter (2006) is a social networking and micro-blogging service that allows its users to send and read other users' updates (known as tweets), • text-based posts of up to 140 characters in length. • Updates are displayed on the user's profile page and delivered to other users who have signed up to receive them. http://en.wikipedia.org/wiki/Twitter Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Twitter Tweet? • Twitter now has more than 140 million active users, sending 340 million tweets every day (3900/second on average, peaks of 25,000) • Twitter users send over a billion tweets every 72 hours • The average Twitter user has 126 followers • • The average Twitter users tweets a link at 9am, achieves a 1.17% click through rate (1.4 followers click the link) • Over 40% of Twitter users do not tweet anything • About 0.05% of the total twitter population attract almost 50% of attention on the channel No one: • 71% of the millions of tweets each day attract no reaction • 25% of Twitter users have no followers • 45% of the mass communications posted on Twitter are nonsense • 8% of Americans use Twitter Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield 99 New Social Media Stats for 2012 The Long Tail © Fabio Ciravegna, University of Sheffield • Overture and Google's success came from an understanding of what Chris Anderson refers to as "the long tail," the collective power of the small sites that make up the bulk of the web's content. • Overture and Google figured out how to enable ad placement on virtually any web page. • What's more, they eschewed publisher/ad-agency friendly advertising formats such as banner ads and popups in favor of minimally intrusive, context-sensitive, consumer-friendly text advertising. • The Web 2.0 lesson: leverage customer-self service and algorithmic data management to reach out to the entire web, to the edges and not just the center, to the long tail and not just the head. 38 Wednesday, 19 September 12 RSS Feeds • One of the things that has made a difference is a technology called RSS. © Fabio Ciravegna, University of Sheffield • RSS is the most significant advance in the fundamental architecture of the web since early hackers realized that CGI could be used to create databasebacked websites. • RSS allows someone to link not just to a page, but to subscribe to it, with notification every time that page changes. • Some call it the "live web". • Dynamic websites" (i.e., database-backed sites with dynamically generated content) replaced static web pages well over ten years ago • What's dynamic about the live web are not just the pages, but the links. • A link to a weblog is expected to point to a perennially changing page, with "permalinks" for any individual entry, and notification for each change. • An RSS feed is thus a much stronger link than, say a bookmark or a link to a single page. Wednesday, 19 September 12 39 RSS Feeds (2) © Fabio Ciravegna, University of Sheffield •RSS also means that the web browser is not the only means of viewing a web page. • While some RSS aggregators, such as Bloglines, are web-based, others are desktop clients, and still others allow users of portable devices to subscribe to constantly updated content. •RSS is now being used to push not just notices of new blog entries, but also all kinds of data updates, including stock quotes, weather data, and photo availability. 40 Wednesday, 19 September 12 Harnessing Collective Intelligence in Texts , University of Sheffield Wednesday, 19 September 12 Text Analysis • Two • Information Retrieval • Keyword matching (as in search engines) • Issues: • Synonyms (beautiful vs attractive) • Polysemous (bank Vs bank Vs bank) • This makes extracting information by using keywords interesting effective and a challenge • Several open source tools e.g. Solr Natural Language Processing • Wide spectrum • From named entity recognition (e.g. People’s name) • To event extraction • To modelling user intentions Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • main historic approaches to text analysis Anatomy of Search Engines: indexing Ranking Algorithm Crawler Ranked Documents Indexer Indices 43 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield pages Indexing •Crawler © Fabio Ciravegna, University of Sheffield •Discovers pages over the web •Sends pages to Indexer •Indexer •Extract keywords and creates tables of indices •Ranker Next Slides •Decides an order of importance/relevance among pages Next Slides 44 Wednesday, 19 September 12 Indexing via Inverted Files © Fabio Ciravegna, University of Sheffield •Extract words from each document A picture by M. Lapata 45 Wednesday, 19 September 12 Extracting words © Fabio Ciravegna, University of Sheffield •Tokenization: separation of words •From a document (one big string) • “I had my porridge hot” •To a sequence of strings • “I” “had” “my“ “porridge” “hot” •Stop word filtering •Words with a very high frequency but low semantic content (meaning) are not considered • the, a, an, on, in, for, of, with, without, etc. • “had” “porridge” “hot” 46 Wednesday, 19 September 12 Processing Indexing via Inverted Files (2) •Create an Index for each word © Fabio Ciravegna, University of Sheffield • Keeping track of position of words in files 47 Wednesday, 19 September 12 Towards Semantic Analysis advanced version of Semantic Analysis (or NLP) is to look for special types of keywords, e.g. Named Entities or domain specific terms (terminology in aerospace) • Identifies small scalable structures (proper names, dates, numbers, currencies, etc.) • Most classes fixed (e.g. People, Locations) Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • An NE Recognition & Coreference Organisation 19:16 Moody's rates Province of Saskatchewan A3 Moody's Investors Service Inc said it assigned an A3 rating to the Province of Saskatchewan's C$115 million bond offering that was priced today. MNY The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for % government purposes, mainly Saskatchewan Power Corp. Date © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 & Coreference NE Recognition & Coreference Organisation 19:16 Moody's rates Province of Saskatchewan A3 Moody's Investors Service Inc said it assigned an A3 rating to the Province of Saskatchewan's C$115 million bond offering that was priced today. MNY The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for % government purposes, mainly Saskatchewan Power Corp. Date © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 NYU: NE Recognition (1994) •Gazetteer lookup • Patterns: Person -> FirstName + Word.Capitalised Person -> Person + Word.Capitalised Company -> Word.Capitalised+ <company-indicator> After Name Recognition [name type: Person Sam Schwarz] retired as executive vice president of the famous hot dog manufacturer, [name type: Company Hupplewhite Inc.] He will be succeeded by [name type: Person Harry Himmelfarb]. © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Terminology Recognition • Task of reducing a term to a URI 51 Wednesday, 19 September 12 Collaborative Filtering 52 Wednesday, 19 September 12 Collaborative Filtering • The process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc Applications of collaborative filtering typically involve very large data sets • The method of making automatic predictions about the interests of a user by collecting taste information from many users • Underlying assumption: those who agreed in the past tend to agree again in the future • Predictions are specific to the user using information from many users © Fabio Ciravegna, University of Sheffield • For example, a collaborative filtering or recommendation system for music tastes could make predictions about which music a user should like given a partial list of that user's tastes (likes or dislikes). 53 Wednesday, 19 September 12 Collaborative Filtering from the more simple approach of giving an average (non-specific) score for each item of interest, for example based on its number of votes © Fabio Ciravegna, University of Sheffield • Differs 54 Wednesday, 19 September 12 Approaches to Collaborative Filtering: user centric • Given a user u • Look for users who share the same rating patterns with u • Use their ratings to calculate a prediction for u • Two types: • Active filtering: peer-to-peer based approach • peers, coworkers, and people with similar interests rate products, reports, and other material objects, and sharing this information over the web for other people to see © Fabio Ciravegna, University of Sheffield • User-centric 55 Wednesday, 19 September 12 User Centric (2) Passive-filtering: collects information implicitly • Web browser collects user’s preferences by following and measuring their actions • E.g. Purchasing an item or not • See lecture 4 on collective intelligence for search engines • The Bing/Google issue © Fabio Ciravegna, University of Sheffield • 56 Wednesday, 19 September 12 Approaches to Collaborative Filtering (2) • Product centric Build an item-item matrix determining relationships between pairs of items • Using the matrix, and the data on the current user, infer his taste © Fabio Ciravegna, University of Sheffield • 57 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Fas forward: today’s lab class 58 Wednesday, 19 September 12 Seeking Collective Intelligence We will seek collective intelligence to address situation awareness thought social media • Situation awareness = understanding of situations, e.g. emergencies • We will in particular address: • The use of Twitter • The modelling of the context using Linked Open Data • • You should know about The use of maps for displaying information © Fabio Ciravegna, University of Sheffield • 59 Wednesday, 19 September 12 Your Exercise Create a program that enables situation awareness • Identify a situation awareness scenario (e.g. fire) and their associated keywords (e.g. fire brigade) • Create a Java program that: • Queries Twitter and retrieves relevant posts • Saves tweets in a database (e.g. mySql or Google FusionTables) • Extracts information on names by sending tweets to Zemanta.org • • • E.g. Identify references to where and who Queries DbPedia.org to get further references “around the entities mentioned” • Note: you should have a clear plan on what “around” means • You will have to design a very small ontology based on the types returned by Zemanta Display tweets and related concepts in a relevant way, for example using maps for where and related concepts © Fabio Ciravegna, University of Sheffield • 60 Wednesday, 19 September 12 Example: crashes Identify the scenario (e.g. road crashes) • Query Twitter for relevant messages • Devise relevant queries and methods to classify relevant messages - for example use regular expressions such as • <xxx> crash(ed) … on ... © Fabio Ciravegna, University of Sheffield • 61 Wednesday, 19 September 12 Examples Link: anchor=Regent Street confidence=0.862714 entity_type= Target: url=http://maps.google.com/maps? ll=51.5108,-0.1387&spn=0.01,0.01&q=51.5108,-0.1387 (Regent%20Street)&t=h title=Regent Street © Fabio Ciravegna, University of Sheffield type=geolocation 62 Wednesday, 19 September 12 Dbpedia.org • Query DBPedia.org to get relevant points of interest surrounding Regent Street • • Choose the type of points of interest (e.g. Monuments) Display on a map You can use Google Fusion Tables © Fabio Ciravegna, University of Sheffield • 63 Wednesday, 19 September 12 • The point is not to implement everything today • The point is to draw a plan and show the details on how you would do it • • So to show you understood and know how to use the tools • Required: • Architecture, Twitter query, DBpedia query • Some implementation of the workflow (At least partial) implementation (today or in the future) is required © Fabio Ciravegna, University of Sheffield The whole point is... 64 Wednesday, 19 September 12 Prof. Dr. Fabio Ciravegna, Director of Research and Innovation in the Digital World University of Sheffield, United Kingdom and Department of Computer Science University of Sheffield, United Kingdom f.ciravegna@shef.ac.uk Common work with Vita Lanfranchi, Neil Ireson, Annalisa Gentile, Sam Chapman, Paul Ridgeway, Duncan Grant,65 Ziqi Zhang and Elizabeth Cano Basave Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Situation awareness via social streams • All knowledge that is accessible and can be integrated into a coherent picture to assess and cope with a situation (Sarter 1991) • Wednesday, 19 September 12 Complete and reliable information elusive © Fabio Ciravegna, University of Sheffield Situation Awareness Information Needs Typical questions for SA in Emergency Management: • Topics discussed • • Trend analysis Event identification • • who/what/where/when e.g. what areas of Haiti need what type of support? Are there any likely survivors trapped under the rubbles? © Fabio Ciravegna, University of Sheffield • 67 Wednesday, 19 September 12 Social Streams and SA Online forums, web sites, RSS feeds, Microblogging • Ubiquitous, rapid, cross platform medium of communication (Vieweg 2010) • Tweeters as sensors (Sakaki 2010) • They support self-organising behaviour that can produce accurate results often in advance of official communication (Vieweg 2010) • The Arab Spring uprisings • The vast majority of people got their information from social media sites (88% in Egypt and 94% in Tunisia) • Facebook usage swelled in the Arab region between January and April and sometimes more than doubled • • © Fabio Ciravegna, University of Sheffield • Number of users jumped by 30% to 27.7m, compared with 18% growth during the same period in 2010 #1 Twitter hashtag in 2011 was #egypt Wednesday, 19 September 12 68 Requirements • Monitoring in real time • Identifying relevant messages while getting rid of noise • Ascertaining reliability of source or information • Relating information to background knowledge • Understanding timeliness, geo- reference points • Monitoring specific groups/networks • Separating local information from external information (Vieweg 2010) • Getting messages in languages you understand © Fabio Ciravegna, University of Sheffield • Analysis during the aftermath • Storing, searching, bookmarking, trend analysis 69 Wednesday, 19 September 12 A SimplifiedTask Monitoring the development of a house fire 10:00 am 11:00 am 11:00 am © Fabio Ciravegna, University of Sheffield 10:49 am 70 Wednesday, 19 September 12 A SimplifiedTask Monitoring the development of a house fire 10:00 am 11:00 am 11:00 am © Fabio Ciravegna, University of Sheffield 10:49 am 70 Wednesday, 19 September 12 A SimplifiedTask Monitoring the development of a house fire 10:00 am 11:00 am 11:00 am © Fabio Ciravegna, University of Sheffield 10:49 am 70 Wednesday, 19 September 12 A SimplifiedTask Monitoring the development of a house fire 10:00 am 11:00 am 11:00 am © Fabio Ciravegna, University of Sheffield 10:49 am 70 Wednesday, 19 September 12 A SimplifiedTask Monitoring the development of a house fire 10:00 am 11:00 am 11:00 am © Fabio Ciravegna, University of Sheffield 10:49 am 70 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Collective Intelligence 71 Wednesday, 19 September 12 Social Streams • • Characteristics: • High in volume, and constantly increasing, • Often duplicated, incomplete, imprecise and incorrect • Written in informal style, extensive use of shorthand, symbols (e.g., emoticons), misspellings etc. • Generally concerning the short-term zeitgeist • Covering every conceivable domain Semantic Underspecification • Your life in 140 characters • The context is all important Spammers are an everyday reality Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • 72 The Web seen through Wikipedia © Fabio Ciravegna, University of Sheffield Barack Hussein Obama is the 44th and current President of the United States. He is the first African American to hold the office. Obama served as a U.S. Senator representing the state of Illinois from January 2005 to November 2008, when he resigned following his victory in the 2008 presidential election. 73 Wednesday, 19 September 12 The Web seen through Wikipedia © Fabio Ciravegna, University of Sheffield Barack Hussein Obama is the 44th and current President of the United States. He is the first African American to hold the office. Obama served as a U.S. Senator representing the state of Illinois from January 2005 to November 2008, when he resigned following his victory in the 2008 presidential election. 73 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield The Web through Social Streams Twitter alone produces the equivalent in text of the 1.4m pages in Wikipedia every two days 74 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield The Web through Social Streams Twitter alone produces the equivalent in text of the 1.4m pages in Wikipedia every two days 74 Wednesday, 19 September 12 All Examples are Real! • • • Bristol Carnival, • Bristol, UK, July 2011 • User: emergency service control room (Gold Command) © Fabio Ciravegna, University of Sheffield • Conservative Party Conference • Manchester, UK, October 2011 • User: Emergency service control room (Gold Command) Forward Defensive Exercise • London, February 2012 • User: undisclosed North East Floods, July 2012 • User: SSSW 2012 75 Wednesday, 19 September 12 Linguistic issues • Short sentences • Alternative Language Use of slang/swear words • Negatives • Conditional statements • Hope/prayer statements • Use of irony/sarcasm • Ambiguity • Unreliable capitalisation • Data sparsity (Saif 2012) Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • 76 Sarcasm • Look for word combinations with opposite polarity, e.g. “rain” or “delay” plus “brilliant” Going to the dentist on my weekend home. Great. I'm totally pumped • Inclusion of world knowledge / ontologies can help • e.g. knowing that people typically don't like going to the dentist, • that people typically like weekends better than weekdays It's an incredibly hard problem and an area where we expect not to get it right that often A slide by Diana Maynard Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • 77 © Fabio Ciravegna, University of Sheffield Modelling Streams and Context Wednesday, 19 September 12 Modelling Streams and Context A tuple: <D, C, F > • D is a stream of documents d • C is a feature vector representing the set of contexts of documents in four dimensions c = < dWt, dWo, dWn, dWe > • F: function associating context to a document dWt F dWn dWe dWo D Wednesday, 19 September 12 Lanfranchi 2012 79 © Fabio Ciravegna, University of Sheffield • The four dimensions What (URIs or keywords) • • • • The event reported and its entity participants Who (URIs) • Author’s profiles e.g. their geographic location, age, interests, groups and social connections • Agents (e.g. people and organisations mentioned in documents) © Fabio Ciravegna, University of Sheffield • When • The time of post of a document • The time of an event mentioned in a post (past, present, future) Where (URIs) • The location of the post and/or the location of the event 80 Wednesday, 19 September 12 What Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield What? • Identifying, classifying, clustering, ... • Events (and their sub-events) • Involved entities (including their URIs) Solutions in literature • Natural Language Processing • A daunting task • • jux lyk u do to myn...RT @qua_benah: Damn!! He is gonna flood myTL with nonfa Shallow(er) statistical methods • • Wednesday, 19 September 12 A collection of terms: • Relevant Words (and their synonyms) • Relevant Tags (e.g. hashtags in Twitter) • Relevant References (e.g. proper names) Relevant relations (generally shallow) © Fabio Ciravegna, University of Sheffield • 82 Term Relevance Relevance in terms of • Domain relevance (Zhang 2011) • • • Match in pre-defined dictionary (e.g. geographic resource) Semantic relatedness with terms in the domain/dictionary Bursti-ness (Pohl 2012): • • Frequency and relevance of term grow contextually in different streams Study of distribution of terms over time and streams © Fabio Ciravegna, University of Sheffield • 83 Wednesday, 19 September 12 84 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield 84 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield 84 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield 84 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Assigning a URI to NE Terms POS Accuracy • • Parser Accuracy • • • Ritter et al. 2011: 0.88 Ritter: 0.875 Named Entity Recognition F(1) • Stanford NER: 0.44 • Ritter: 0.67 Named Entity Classification F(1) • Stanford NER: 0.29 • Ritter: 0.59 On standard texts F(1)>90% © Fabio Ciravegna, University of Sheffield • 85 Wednesday, 19 September 12 Assigning a URI to NE Terms POS Accuracy • • Parser Accuracy • • • Ritter et al. 2011: 0.88 Ritter: 0.875 Named Entity Recognition F(1) • Stanford NER: 0.44 • Ritter: 0.67 Named Entity Classification F(1) • Stanford NER: 0.29 • Ritter: 0.59 On standard texts F(1)>90% © Fabio Ciravegna, University of Sheffield • 85 Wednesday, 19 September 12 Assigning a URI to NE Terms POS Accuracy • • Parser Accuracy • • • Ritter et al. 2011: 0.88 Ritter: 0.875 Named Entity Recognition F(1) • Stanford NER: 0.44 • Ritter: 0.67 Named Entity Classification F(1) • Stanford NER: 0.29 • Ritter: 0.59 On standard texts F(1)>90% © Fabio Ciravegna, University of Sheffield • Charles, Prince of Wales en.wikipedia.org/wiki/Charles,_Prince_of_Wales 85 Wednesday, 19 September 12 Assigning a URI to NE Terms POS Accuracy • • Parser Accuracy • • • Ritter et al. 2011: 0.88 Ritter: 0.875 Named Entity Recognition F(1) • Stanford NER: 0.44 • Ritter: 0.67 Named Entity Classification F(1) • Stanford NER: 0.29 • Ritter: 0.59 On standard texts F(1)>90% © Fabio Ciravegna, University of Sheffield • Charles, Prince of Wales en.wikipedia.org/wiki/Charles,_Prince_of_Wales 85 Wednesday, 19 September 12 Assigning a URI to NE Terms POS Accuracy • • Parser Accuracy • • • Ritter et al. 2011: 0.88 Ritter: 0.875 Named Entity Recognition F(1) • Stanford NER: 0.44 • Ritter: 0.67 On standard texts F(1)>90% Named Entity Classification F(1) • Stanford NER: 0.29 • Ritter: 0.59 Sarah Ferguson ? http://en.wikipedia.org/wiki/Sarah,_Duchess_of_York Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Charles, Prince of Wales en.wikipedia.org/wiki/Charles,_Prince_of_Wales 85 Assigning a URI to NE Terms POS Accuracy • • Parser Accuracy • • • Ritter et al. 2011: 0.88 Ritter: 0.875 Named Entity Recognition F(1) • Stanford NER: 0.44 • Ritter: 0.67 On standard texts F(1)>90% Named Entity Classification F(1) • Stanford NER: 0.29 • Ritter: 0.59 Sarah Ferguson ? http://en.wikipedia.org/wiki/Sarah,_Duchess_of_York Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Charles, Prince of Wales en.wikipedia.org/wiki/Charles,_Prince_of_Wales black eye peas?85 Event detection Major events (e.g. Flood) have sub-events • • • River burst at location XY Identification of Sub-Events • User/Domain dictionary of potentially interesting events • Clustering of messages around bursty terms (Pohl 2012) Reliability of events • Confirmation by reliable independent sources • 200 people looting D&S Alahan's on 15 Coronation Street! • Survive the test of time • Methods of detection • See spammers in the who section © Fabio Ciravegna, University of Sheffield • 86 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Just a matter of feelings 87 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Just a matter of feelings 87 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Just a matter of feelings 87 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Just a matter of feelings 87 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Just a matter of feelings 87 Wednesday, 19 September 12 What and Post Relevance Tao et al. 2012 Many messages may refer to a specific topic/event • • Which are the really relevant ones? Ranking features • Keywords • • • Semantic features • Semantic concepts overlapping with/related to domain concepts • Diversity of semantic concepts mentioned in post • Sentiment (positive/negative) Syntactic • • Overlapping with domain description It contains an URL, a Hashtags, is a Reply, it is long Context • Social context (# followers/followees, # posts), • Temporal context (how far away from now) Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • 88 Who Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Characterising Who’s Tables from Cano 2012 User features • Content Features i) message readability; ii) topical complexity; iii) entity complexity; iv) structure complexity features - use of links to classify topics/ users (Abel 2011) Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Message Features 90 Meformers and other animals Table from Cano 2012 Adapted from Naaman 2010 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Classifying users allows: - understanding better the message content - Identifying changes in behaviour - May signal important events 91 Modelling the Social Network Semantically Enriched Communication Network (SECN) • • A formal representation of individuals and their communication exchanges • User profiles: a set of topics, weighted according to relevance to the user • Similarities between users based on their profiles A SECN is a typed, weighted graph: • Typed: nodes and edges within the graph are of several different types • Weighted: types of edges can be assigned a weight to boost importance of one type of connection or another Lanfranchi - to appear Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • 92 Spammers and Trolls Spammers: • • Send unsolicited bulk messages indiscriminately Trolls: • Someone who posts inflammatory, extraneous, or offtopic messages • Including someone spreading false rumours during emergencies • e.g. to create panic • To influence the international community (e.g. Bahrain protests) © Fabio Ciravegna, University of Sheffield • 93 Wednesday, 19 September 12 Profiling Spammers Duplicate spammers and @spammers: • • Malicious promoters • • Post nearly identical tweets with or without links sometimes addressing someone specifically (@user) Post legitimate tweets interspersed with adverts Friend infiltrators • Befriend users to receive reciprocal friendship and then spam them Kyumin Lee et al 2011 © Fabio Ciravegna, University of Sheffield • 94 Wednesday, 19 September 12 Clever spammers 23,869 spammer studied • • 18,307 not detected by Twitter after 3 months About the 18,307 “clever” spammers • Posting only 4 tweets a day • Linking to disreputable pages • • Topic not mentioned in tweets Abnormal number of followers/followees (>10,000) • • High fluctuation in followers and followees (in terms of days) Non-spammers max of 1,000 sf • Quite a stable set Kyumin Lee et al 2011 © Fabio Ciravegna, University of Sheffield • 95 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Identifying spammers Kyumin Lee et al 2011 Wednesday, 19 September 12 96 © Fabio Ciravegna, University of Sheffield Where on earth... 97 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Where on earth... 97 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Where on earth... 97 Wednesday, 19 September 12 Geo-Location Needed to ascertain: • If message is relevant (several floods are happening at the same time) • To identify the location of the issue/information reported • Can be • In metadata • • Wednesday, 19 September 12 GPS information via e.g. mobile phone In the document itself • Very detailed (“fire at 55 Surrey Street, Brighton”) • Rather generic (“Bridge opposite the Reed Deer”) • Incomplete (55 Surrey Street) • 18-40% of messages related to emergencies contain references • 78-86% of users have at least one message containing geo-information • Mainly depending on the type of hazard and phase (preparation Vs impact phase) (Vieweg 2010) © Fabio Ciravegna, University of Sheffield • 98 WHERE Location of a message Vs location of reference • The case of the Olympics: what messages come from the ground? • GPS active device ! • 320m random tweets from Cumbria (UK) from start of 2011 • • 1.5m tweets from London in 2012 • • • ~0.9% are geocoded 3.7% are geocoded Flickr has large quantities of GPS tagged pictures Hence extracting locations from post is a requirements © Fabio Ciravegna, University of Sheffield • 99 Wednesday, 19 September 12 Geo-Location Approach Lists of points of interests taken from Linked Open Data • • e.g. data.gov.uk, OpenStreetMap, etc. String metrics to identify candidate terms • Social streams tend to be highly ambiguous • • © Fabio Ciravegna, University of Sheffield • There is a London Road in every single town in England (lad also a Wall Street) Disambiguation based on context • User, document, point of interest • User locations, terms, terms related to the point of interest, related users, etc. • Approach: • Shannon’s information entropy (Ireson 2010) 100 Wednesday, 19 September 12 Reported 0 2 4 6 8 10 12 When? Wednesday, 19 September 12 14 16 © Fabio Ciravegna, University of Sheffield Event When • Time of posting • In metadata • May or may not be the time of the event (e.g. RT) Time in the message body and implications • Relevant to event but must be balanced with the linguistic form • • River Don burst into Meadowhall at half past eight River Don flooded. They say tomorrow it will stop • • I wish it stopped raining and the flood stopped as well • • The flood has not stopped Flood flood flood • Wednesday, 19 September 12 Tomorrow is not the time of the event Is it now or... © Fabio Ciravegna, University of Sheffield • 102 Post Time and Event time Timestamp is not necessarily the correct one • There is always a lag between the event and the appearance on social streams • • • Due to the need to react Retweets happen several hours after events Different events have different lag time Event Reported 0 2 4 6 8 10 12 14 16 © Fabio Ciravegna, University of Sheffield • 103 Wednesday, 19 September 12 Situation Awareness • Typically interested in the present • It happens • It has just happened • It is about to happen Focus on a timeline • • • Expectations/distant past/distant future are of limited use to put on a timeline Focus on messages with immediate temporal frame expressions • An easier task than positioning on a map everything • Heavily influenced by the time of posting Approach: pattern matching/statistical classification Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • 104 © Fabio Ciravegna, University of Sheffield Tracking Real Time Intelligence in Data Streams Wednesday, 19 September 12 TRIDS Tracking Real Time Intelligence in Data Streams • • A modular architecture enabling SA via Social Streams • During preparation, emergency and aftermath • Event identification • Multi-dimensional presentation (spatial, temporal) © Fabio Ciravegna, University of Sheffield • Back-end • Gathering, storing and processing information • Cloud Computing on a distributed multi-core archive • Data analytics based on large scale NLP technologies Front end • Visual analytics based on multi-dimensional visualisation 106 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield TRIDS architecture 107 Wednesday, 19 September 12 Processing Messages Integration of several streams into one funnelling point • • Metadata extraction • • Uniform representation of document regardless the stream it comes from Users, mentions, hashtags, keywords, GPS-coordinates Message Augmentation • User profiling (network, location, timeline, entropy, me/informer, etc.) • Content-based geo location • • © Fabio Ciravegna, University of Sheffield • Names: gazetteer based IE based on large scale geo-LoD (Ireson 2010) Trends and profiling • Points of view profiling (Cano Basave 2012) 108 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield The user in the loop Wednesday, 19 September 12 User Role • User can (must) be in the loop • Users can easily disambiguate and add context • Legal responsibility or Em Serv User must be enabled to ascertain facts and events • With the maximum of information possible • Focussing on events without losing the big picture or other important events © Fabio Ciravegna, University of Sheffield • 110 Wednesday, 19 September 12 • A tuple: <D, C, F, V, T > • D is a stream of documents d • C is a feature vector representing the set of contexts of documents in four dimensions c = < dWt, dWo, dWn, dWe > • F: function associating context to a document • V is a set of visualisation strategies characterised by dimensionbased visualisation widgets vWt, vWo, vWn, vWe • T is the definition of the user information needs, • © Fabio Ciravegna, University of Sheffield Extending the Formalism a task t T is a tuple: <v, tWt, tWo, tWn, tWe> defining the visualisation strategy and the constraints on the four dimensions • e.g. specifying a set of terms/tags/concepts (tWt), document authors, groups or types (tWo), timeframe (tWn) and geo-locations (tWe) Lanfranchi 2012 Wednesday, 19 September 12 24 111 Context-based Visual Analytics Where-based visualisation • Different levels of details with a clustering algorithm for aggregating data and displaying frequency counts • Multiple visual parameters to map different attributes • © Fabio Ciravegna, University of Sheffield • Different colours are associated with the density of information 112 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Who Wednesday, 19 September 12 Focused trends detection What-based Visualisation • Tag-cloud based visualisation to display trending topics based on specific user interests • © Fabio Ciravegna, University of Sheffield • Several dimensions to calculate trends and display them in multiple widgets ConceptCloud 114 Wednesday, 19 September 12 Temporal Trending When-based visualisation • • A time series visualisation combining column and areas plotting to provide the frequency distribution of selected social media messages. © Fabio Ciravegna, University of Sheffield • The timeline is dynamic, customisable via facets 115 Wednesday, 19 September 12 Searching in the aftermath • For • Debriefing and lesson learnt sessions, • Documents explaining actions, decisions and storing eventual evidence • Summary views and details of specific actions © Fabio Ciravegna, University of Sheffield • Hybrid search interface for mixed keyword and semantic search • Multiple spatio-temporal views over the available knowledge 116 Wednesday, 19 September 12 • Tasked to “listen for events” over 72 hours • 21st, 22nd and 23rd of February 2012 • Operators • • Three operators analysing the data streams in real time, focussing their analysis on specific events and their evolution, these analysts were working at different physical locations; • An operator working on the collected data to do post-event analysis © Fabio Ciravegna, University of Sheffield Forward Defensive 3 screens each 117 Wednesday, 19 September 12 118 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield © Fabio Ciravegna, University of Sheffield Geo Tagging 127 Wednesday, 19 September 12 • Manchester, • Event • User: UK, October 2011 spotting and riot prevention Emergency service control room • Gold Command © Fabio Ciravegna, University of Sheffield Conservative Party Conference 128 Wednesday, 19 September 12 Emergency Response – Ci=zen Awareness Wednesday, 19 September 12 Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited Understanding -­‐ Tracking protester movements Emergency Response – Ci=zen Awareness Wednesday, 19 September 12 Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited Understanding -­‐ Tracking protester movements Emergency Response – Ci=zen Awareness Wednesday, 19 September 12 Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited Cri=cal informa=on – police targeted • Directly targeted individuals • Protestor networking police posi=ons and movements • Encouraged direct ac=on Emergency Response – Ci=zen Awareness Wednesday, 19 September 12 Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited Cri=cal informa=on – police targeted Emergency Response – Ci=zen Awareness Wednesday, 19 September 12 Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited Understanding -­‐ Rumours spreading Emergency Response – Ci=zen Awareness Wednesday, 19 September 12 Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited Discover & understanding -­‐ Unexpected ac5vi5es Emergency Response – Ci=zen Awareness Wednesday, 19 September 12 Commercially Sensi=ve – DO NOT DISTRIBUTE -­‐ Copyright © 2011 Knowledge Now Limited Discover & understanding -­‐ Unexpected ac5vi5es Conclusions It is a messy messy world • • We are asked to address all the most difficult natural language problems at the same time • Linguistic analysis • Contextual analysis • Common sense reasoning © Fabio Ciravegna, University of Sheffield • Why not use Wikipedia, then? 135 Wednesday, 19 September 12 Conclusions (2) We can do quite well in Situation Awareness even now • • The user in the loop addresses several issues complex for the technology © Fabio Ciravegna, University of Sheffield • Huge issues in terms of privacy and legality • X: Can you follow people? • Y: Yes but it is against the law • X: This is our problem, not yours. We abide all the legislation of the countries we keep our data • 136 Wednesday, 19 September 12 Conclusions (2) We can do quite well in Situation Awareness even now • • The user in the loop addresses several issues complex for the technology © Fabio Ciravegna, University of Sheffield • Huge issues in terms of privacy and legality • X: Can you follow people? • Y: Yes but it is against the law • X: This is our problem, not yours. We abide all the legislation of the countries we keep our data • Y: • Where do you keep your data about the UK? X: Saudi Arabia 136 Wednesday, 19 September 12 f.ciravegna@shef.ac.uk http://www.dcs.shef.ac.uk/~fabio/ Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Thank You Bibliography • Elizabeth Cano Basave: Capturing And Modelling Entity Perception In Social Streams, PhD thesis, University of Sheffield, October 2012 • Mor Naaman, Jeff Boase and Chih-Hui Lai. Is it Really About Me? Message Content in Social Awareness Streams. In Proceedings, the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW 2010), February 2010, Savannah, Georgia. © Fabio Ciravegna, University of Sheffield • 138 Wednesday, 19 September 12 Your Exercise Create a program that enables situation awareness • Identify a situation awareness scenario (e.g. fire) and their associated keywords (e.g. fire brigade) • Create a Java program that: • Queries Twitter and retrieves relevant posts • Saves tweets in a database (e.g. mySql or - see later - Google FusionTables • Extract information on names by sending tweets to Zemanta.org • • • E.g. Identify references to where and who Query DbPedia.org to get further references “around the entities mentiones” • Note: you should have a clear plan on what “around” means • You will have to design a very small ontology based on the types returned by Zemanta Display tweets and related concepts in a relevant way, for example using maps for where and related concepts © Fabio Ciravegna, University of Sheffield • 139 Wednesday, 19 September 12 Example: crashes Identify the scenario (e.g. road crashes) • Query Twitter for relevant messages • Devise methods to classify relevant messages - for example use regular expressions such as • <xxx> crash(ed) … on ... © Fabio Ciravegna, University of Sheffield • 140 Wednesday, 19 September 12 Zemanta: Output Example Link: anchor=Regent Street confidence=0.862714 entity_type= Target: url=http://maps.google.com/maps?ll=51.5108,-0.1387&spn=0.01,0.01&q=51.5108,-0.1387 (Regent%20Street)&t=h title=Regent Street © Fabio Ciravegna, University of Sheffield type=geolocation 141 Wednesday, 19 September 12 Dbpedia.org • Query DBPedia.org to get relevant points of interest surrounding Regent Street • • Choose the type of points of interest (e.g. Monuments) Display on a map You can use Google Fusion Tables © Fabio Ciravegna, University of Sheffield • 142 Wednesday, 19 September 12 • The point is not to implement everything today • The point is to draw a plan and show the details on how you would do it • • So to show you understood and know how to use the tools • Required: • Architecture, Twitter query, DBpedia query • Some implementation of the workflow (At least partial) implementation (today or in the future) is required © Fabio Ciravegna, University of Sheffield The whole point is... 143 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Some techie details 144 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Twitter API Wednesday, 19 September 12 Twitter API http://net.tutsplus.com/tutorials/other/diving-into-the-twitter-api/ • There • The normal REST based API • methods constitute the core of the Twitter API, and are written by Twitter itself. It allows other developers to access and manipulate all of Twitter’s main data. • You’d use this API to do all the usual stuff you’d want to do with Twitter including retrieving statuses, updating statuses, showing a user’s timeline, sending direct messages and so on. The Search API • • The Stream API • • Lets you look beyond you and your followers. You need this API if you are looking to view trending topics and so on. lets developers sample huge amounts of real time data. Since this API is only available to approved users, we aren’t going to go over this today. Public data can be freely accessed without an API key. When requesting private data and/or user specific data, Twitter requires authentication Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • are three separate Twitter APIs actually. http://dev.twitter.com/pages/every_developer The API (ctd) are limits to how many calls and changes you can http://dev.twitter.com/pages/rate-limiting make in a day • API usage is rate limited with additional fair use limits to protect Twitter from abuse. • The API is entirely HTTP-based • Methods to retrieve data from the Twitter API require a GET request. Methods that submit, change, or destroy data require a POST. • API Methods that require a particular HTTP method will return an error if you do not make your request with the correct one. • HTTP Response Codes are meaningful • The API presently supports the following data formats: XML, JSON, and the RSS and Atom syndication formats, with some methods only accepting a subset of these formats. Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • There REST API Methods http://apiwiki.twitter.com/w/page/22554679/Twitter-API-Documentation Methods • statuses/public_timeline • statuses/home_timeline • statuses/friends_timeline • statuses/user_timeline • statuses/mentions • statuses/retweeted_by_me • statuses/retweeted_to_me • statuses/retweets_of_me • And several others!!!! Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Timeline Interacting with Twitter in Java is an unofficial Java library for the Twitter API. • You can easily integrate Java application with the Twitter service • Twitter4J is featuring: • 100% Pure Java - works on any Java Platform version 1.4.2 or later • Android platform and Google APP Engine ready • Zero dependency : No additional jars required • Built-in OAuth support • Out-of-the-box gzip support • Just download and add its jar file to the application classpath. Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Twitter4J http://twitter4j.org An Example tweets from people in Sheffield about Sheffield • People in Sheffield == geolocated in Sheffield • About Sheffield == using #Sheffield •A number of examples at https://github.com/yusuke/twitter4j/tree/master/twitter4j-examples/src/main/java/twitter4j/examples Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Get © Fabio Ciravegna, University of Sheffield package tryno.server; import twitter4j.*; import java.util.List; public class GetTweetByLocation { Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield public String getSimpleTimeLine(){ Twitter twitter = new TwitterFactory().getInstance(); String resultString= ""; try { // it creates a query and sets the geocode //requirement Query query= new Query("#sheffield"); query.setGeoCode(new GeoLocation(53.383, -1.483), 2, Query.KILOMETERS); //it fires the query QueryResult result = twitter.search(query); (ctd) Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield (ctd from previous slide) //it cycles on the tweets List<Tweet> tweets = result.getTweets(); for (Tweet tweet : tweets) { ///gets the user User user = twitter.showUser(tweet.getFromUser()); Status status= (user.isGeoEnabled())?user.getStatus():null; if (status==null) resultString+="@" + tweet.getFromUser() + " (" + user.getLocation() + ") - " + tweet.getText(); else resultString+="@" + tweet.getFromUser() + " (" + ((status! =null&&status.getGeoLocation()!=null)? status.getGeoLocation().getLatitude()+", "+status.getGeoLocation().getLongitude():user.getLocation()) + ") - " + tweet.getText(); } } catch (Exception te) { te.printStackTrace(); System.out.println("Failed to search tweets: " + te.getMessage()); System.exit(-1); } return resultString; } Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield public static void main(String[] args) { GetTweetByLocation tt= new GetTweetByLocation(); String aa=tt.getSimpleTimeLine(); System.out.println(aa); }} Wednesday, 19 September 12 Output @eatSheffield (Sheffield) - RT @barandgrillshef: #Sheffield if you had to order a cocktail what would it be, or would you just like a cup from @YorkshireTea ?@barandgrillshef (Leopold Square, Sheffield) - #Sheffield if you had to order a cocktail what would it be, or would you just like a cup from @YorkshireTea ? @Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield @Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield @barandgrillshef (Leopold Square, Sheffield) - Fancy relaxing on the beach #sheffield http://www.youtube.com/watch?v=Dax5Sbt20sA we'll see you there @barandgrillshef (Leopold Square, Sheffield) - #Sheffield #Cloudy according to the BBC http://news.bbc.co.uk/weather/forecast/353 hows your day? @barandgrillshef (Leopold Square, Sheffield) - #mothersday april 3 any plans #sheffield ? why not book a table now www.barandgrillsheffield.co.uk/mothers-day/] http:// @Kineets (sheffield) - @shefgossip what's all the factor lot doing here @katiewaissel24 checked in #sheffield an hour ago? @aryayuyutsu (53.382419,-1.478586) - RT @SheffieldStar 400 workers lose job as firm closes down in #Chesterfield http://bit.ly/hpX8NK (#Sheffield) © Fabio Ciravegna, University of Sheffield @CFMDsFMKX (Sheffield Hallam University) - We're teaching today at #sheffieldhallam #sheffield on our UG programme in #facilitiesmanagement on Managing Premises & The Work Environment @Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield @Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield @Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield @aryayuyutsu (53.382419,-1.478586) - Off for the final night of a most ROTFL-ing and LOL-ing and LMAO-ing #ComedyFestival 2011. I voted for the amazing #Thünderbards! #Sheffield @Map_Game (-12.5743, 131.102) - Where is Sheffield on the map? Play the game at http://www.map-game.com/sheffield #Sheffield Wednesday, 19 September 12 Google Fusion Tables http://www.google.com/fusiontables/public/tour/index.html • Visualize and publish your data as maps, timelines and • Host • your data tables online On your google account • Combine • data from multiple people By joining two tables Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield charts Google Fusion API • It enables programmatic access to Google Fusion Tables content. • It is an extension of Google's existing structured data capabilities for developers. Here are some of the things you can do: • Upload: populating a table in Google Fusion Tables with data, from a single row to hundreds at a time. • The data can come from a variety of sources, such as a local database, .CSV file, data collection form, or mobile device. • Query and download: it is built on top of a subset of the SQL querying language. By referencing data values in SQL-like query expressions, you can find the data you need, then download it for use by your application. Your app can do any desired processing on the data, such as computing aggregates or feeding into a visualization gadget. • Sync: As you add or change data in the tables in your offline repository, you can ensure the most up-to-date version is available to the world by synchronizing those changes up to Google Fusion Tables Tutorials at http://www.google.com/support/fusiontables/bin/answer.py?answer=184641 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield http://code.google.com/apis/fusiontables/ Creating a table © Fabio Ciravegna, University of Sheffield http://code.google.com/apis/fusiontables/docs/developers_guide.html#WritingApps Wednesday, 19 September 12 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield © Fabio Ciravegna, University of Sheffield Adding rows Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield ctd Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Geographic locations Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Expressing GeoLocations Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield The Code: FusionTablePreparation.java You must have a google account Wednesday, 19 September 12 "sql=CREATE TABLE TweetTimeLine (user: STRING, tweet: STRING, whereLoc: LOCATION "sql=INSERT INTO <tableId> (user, tweet, whereLoc) VALUES ('<user>','<the tweet text>', '<Point><coordinates><lat>,<long></coordinates></Point>') Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Creating a table/insert row They are two forms of SQL INSERT © Fabio Ciravegna, University of Sheffield runInsert: it executes all the insertion operations Wednesday, 19 September 12 Example of accessing twitters using twitter4j. Slightly modified version of the one presented during the lecture Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield GetTweetByLocation.java © Fabio Ciravegna, University of Sheffield What we want to create Wednesday, 19 September 12 The Main It creates the table with 3 columns It inserts the tweets into the table Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield insert username and password Instructions your username and password in the main of FusionTablePreparation • Run • • Got the class FusionTablePreparation This will access tweet and will generate a table in your Google Fusion Tables space at http://www.google.com/fusiontables/Home?pli=1 (you must login into google to access that) to your fusion table space • Click on “Owned by Me” • Click on the table TweetTimeLine Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield • Insert ctd will see the tweet table © Fabio Ciravegna, University of Sheffield • You • Select Wednesday, 19 September 12 Visualize -> Map © Fabio Ciravegna, University of Sheffield Result Wednesday, 19 September 12 Jena 173 ¡ A Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ¡ Jena is open source and it includes RDF API An OWL API In-memory and persistent storage SPARQL query engine Wednesday, 19 September 12 incubator.apache.org/jena/ © Fabio Ciravegna, University of Sheffield Reading and writing RDF in RDF/XML, N3 and N-Triples SPARQL in Jena SPARQL 174 Wednesday, 19 September 12 http://www.ibm.com/developerworks/xml/library/j-sparql/#2 SPARQL in Jena • SPARQL queries are created and executed with Jena via classes in the com.hp.hpl.jena.query package • QueryFactory is the simplest approach. QueryFactory has various create() methods to read a textual query from a file or from a String. • create() methods return a Query object, which encapsulates a parsed query. Then create an instance of QueryExecution, a class that represents a single execution of a query. • Call QueryExecutionFactory.create(query, model), • passing in the Query to execute and the Model to run it against. • Note: the query does not need a FROM clause. © Fabio Ciravegna, University of Sheffield • Model= Ontology Model in Jena 175 Wednesday, 19 September 12 QueryExecution There are several different execute methods on QueryExecution, each performing a different type of query • For a simple SELECT query, call execSelect(), which returns a ResultSet. • • ResultSet allows you to iterate over each QuerySolution returned by the query, providing access to each bound variable's value. Alternatively, ResultSetFormatter can be used to output query results in various formats. © Fabio Ciravegna, University of Sheffield • 176 Wednesday, 19 September 12 import com.hp.hpl.jena.query.*; public class jenaToDBpedia { public static void main(String[] args) { String queryString= "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-­‐schema#>"+ "PREFIX type: <http://dbpedia.org/class/yago/>"+ "PREFIX prop: <http://dbpedia.org/property/>"+ "SELECT DISTINCT ?country_name ?country "+ "WHERE {"+ " ?country a type:LandlockedCountries. "+ " ?country rdfs:label ?country_name. "+ " ?country prop:populationEstimate ?population ."+ " FILTER (?population > 15000000 " + " && langMatches(lang(?country_name), 'EN')) ." + "}"; jenaToEndPoint jdp= new jenaToEndPoint(); jdp.queryEndPoint(queryString, "http://dbpedia.org/sparql"); }} Again this code is simplified Never write SPARQL queries directly in the code!!! Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Querying DBpedia in Jena 177 Querying DBpedia in Jena public void queryEndPoint(String queryString, String endPoint){ // now creating query object Query query = QueryFactory.create(queryString); // initializing queryExecution factory with remote service. QueryExecution qexec = QueryExecutionFactory.sparqlService(endPoint, query); //query execution and result processing try { //simple select ResultSet results = qexec.execSelect(); ResultSetFormatter.out(System.out, results, query); } finally { qexec.close(); } } } © Fabio Ciravegna, University of Sheffield import com.hp.hpl.jena.query.*; public class jenaToEndPoint { 178 Wednesday, 19 September 12 • Interface ResultSet • Method Summary Cycle on the solutions (of type QuerySolution) http://jena.sourceforge.net/ARQ/javadoc/com/hp/hpl/jena/query/ResultSet.html Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Getting the results 179 http://jena.sourceforge.net/ARQ/javadoc/com/hp/hpl/jena/query/QuerySolution.html Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Iterate on all variables and return value as Resource or as Literal 180 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Zemanta 181 • Zemanta analyzes user-generated content (e.g. a blog post) using natural language processing and semantic search technology to suggest pictures, tags and links to related articles. • Zemanta suggests content from Wikipedia, YouTube, IMDB, Amazon.com, Crunchbase, Flickr, ITIS, Musicbrainz, Mybloglog, Myspace, NCBI, Rotten Tomatoes,Twitter, Facebook, Snooth and Wikinvest, as well as the blogs of other Zemanta users. © Fabio Ciravegna, University of Sheffield Zemanta.com 182 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Calling Zemanta 183 Wednesday, 19 September 12 © Fabio Ciravegna, University of Sheffield Getting the results 184 Wednesday, 19 September 12