Recruiting Online Volunteers for Linguistic Knowledge Acquisition Ed Kenschaft job talk May 13, 2008 45 minutes www.kenschaft.org/papers/linguistathome.html Outline The internet is essentially unregulated, immensely huge, and growing exponentially. Terrorist groups use the internet for recruiting and training. Computational linguistics subdisciplines such as opinion detection can be used to identify terrorist websites. Most such systems require training data which is not readily available. Other research projects have had success recruiting internet volunteers for comparably difficult tasks. I propose to do the same with opinion labeling, and then extend to other related areas. Challenge: Internet 1.36 billion users (Q1 2008) across entire populated world and diverse language groups 20.7% annual growth (December 2006 to December 2007) 103,160,364 active domains (May 03, 2008) 332,840,730 deleted domains 648,853 new domains in past 24 hours (May 03, 2008) 619,939 (May 12, 2008) Source (user info): www.internetworldstats.com/ Copyright © 2008, Miniwatts Marketing Group Source (domain info): www.domaintools.com/internet-statistics/ Internet Users most users in Asia, Europe, and North America fastest growth in Middle East, Africa, and Latin America Primary Languages of Internet Users largely European, East Asian, and Arabic 206 million others fastest growth in Arabic Internet Summary internet is vast, with tremendously fast growth most content and users are from developed nations and well-studied languages fastest growth is in developing nations and less-studied languages analysts and technology need to keep up Challenge: Use of Internet for Global Terrorism several thousand terrorist websites, growing exponentially purposes propaganda -- worldwide, anonymous ... "The Global Islamic Call to Resistance", 1600 pages, call for selfstarting terrorist cells "Questions and Uncertainties Concerning the Mujahideen and their Operations", doctrinal justifications news bulletins videos of American soldiers being blown up video statements on recent events video game, "Night of Bush Capturing" Challenge: Use of Internet for Global Terrorism purposes [continued] training manuals, e.g. assassination, manufacturing poisons/explosives coordinate attacks between individuals or groups "Encyclopedia of Preparation", huge & growing online manual internet jihadist Irhabi007 helped plan attacks by two men from Atlanta, GA, on Washington, DC, targets "... networks within networks, connections within connections and links between individuals that cross local, national and international boundaries." Peter Clarke, head of the counter-terrorism branch of London's Metropolitan Police Internet Terrorism Summary "The radicalisation process is occurring more quickly, more widely and more anonymously in the internet age, raising the likelihood of surprise attacks by unknown groups whose members and supporters may be difficult to pinpoint." National Intelligence Estimate, USA, 2006 "We have to find a way to stanch the flow. The internet creates a constant reservoir of radicalised people which terrorist groups and networks can draw upon." Professor Bruce Hoffman, terrorism expert, Georgetown University How can we identify terrorist websites? Digression: Humans and Computers are Different computers can do many things that humans can't do (well) humans can do many things that computers can't do (well) Examples of Differences computers only humans only find new prime numbers scan the entire web for "Osama bin Laden" recognize emotions from facial expressions captcha both play chess Crossover humans can impersonate computers long division find new prime numbers computers can impersonate humans Eliza – requires clever rules, limited domain machine learning – requires lots of data Opinion Detection identify opinions and attitudes in texts (more generally, modalities) humans are very good at it, computers are not Opinion Detection (Examples) "America is a mistake, admittedly a gigantic mistake, but a mistake nevertheless." (Sigmund Freud) SPEAKER DISLIKES America "The United States of America is a threat to world peace." (Nelson Mandela) SPEAKER DISLIKES United States of America Opinion Detection (continued) "Mr. McGee, don't make me angry. You wouldn't like me when I'm angry." (David Banner) Mr. McGee SHOULDN'T make me angry Mr. McGee DISLIKES me when I'm angry "All I want for Christmas is my two front teeth." (personal communication) SPEAKER WANTS my two front teeth Opinion Detection Resources humans can do this task well, but not fast enough computers are moderately successful in limited domains TREC 2006 & 2007 accuracy of computers depends on availability of training data TREC 2006 Blog (Opinion Retrieval) Track given a blog entry and a topic, identify whether: the entry is relevant to that topic the entry expresses an opinion on the topic the opinion is positive, negative, or mixed no training data provided CMU used ~10,000 training examples from movie and product reviews (Yang et al 2006) TREC 2006 Examples Opinionated Skype 2.0 eats its young The elaborate press release and WSJ review while impressive don’t help mask the fact that, Skype is short on new ground breaking ideas. Personalization via avatars and ring-tones... big new idea? Not really. Phil Wolff over on Skype Journal puts it nicely when he writes, “If you’ve been using Skype, the Beta version of Skype 2.0 for Windows won’t give you a new Wow! experience.” ... Non-Opinionated Skype Launches Skype 2.0 Features Skype Video Skype released the beta version of Skype 2.0, the newest version of its software that allows anyone with an Internet connection to make free Internet calls. The software is designed for greater ease of use, integrated video calling, and ... TREC 2006 Results (MAP) Topic relevance Best Median 42.29% 16.99% Opinion finding Best Median 30.04% 10.59% Where can we get training data? Volunteer Projects Enlist online volunteers Provide minimal training Optionally, frame as a competitive game "The easiest part is getting the public involved. Most volunteer-computing projects can draw on tens of thousands of people with practically no advertising, relying on word of mouth. The problem is usually keeping these eager amateurs busy." ("Spreading the load", The Economist) Non-computational Projects amateur bird-watchers track bird migrations amateur astronomers spot new comets Galaxy Zoo roughly a million galaxies from Sloan Digital Sky Survey classify elliptical clockwise spiral anticlockwise spiral unclear identify interactions between galaxies, real or illusory Galaxy Zoo Volunteers 100,000+ volunteers within a few months 30 volunteers classify each galaxy peak load 70,000 per hour final datasets 34,617,406 analyses 82,931 users filter unreliable volunteers using known test cases Galaxy Zoo Results unexpected source of error 2 papers submitted for publication currently over 20 projects underway using resulting data future work users are biased toward anticlockwise spirals phase two: more detailed questions phase three: more image sources www.galaxyzoo.org/ Stardust@home Problem Volunteers aerogel sent seven years and 3 billion km through space identify tracks of microparticles in gel 24,000 participants 40 million searches in under a year Results 50 candidate dust particles, each identified by hundreds of participants featured in seven conference papers stardustathome.ssl.berkeley.edu/ Herbaria@home thousands of 19th-century plant specimens with handwritten notes read notes and enter information into database Herbaria@home Volunteers 162 volunteers, Zipfian distribution 68 volunteers transcribed 10 or more entries 24 volunteers transcribed 100 or more entries 7 volunteers transcribed 1000 or more entries Herbaria@home Results 22702 specimens documented (May 5, 2008) no redundancy herbariaunited.org/atHome/ Open Mind Word Expert word sense disambiguation He boarded the plane from gate 53. The ball is not in play until it crosses the plane. systems need training data Open Mind Word Expert Results 90,000 sense taggings over four months 240 words, 87 examples each on average inter-annotator agreement: 66.56% 66.23% precision, vs. 63.32% baseline best precision for words with most training examples Volunteer Projects Summary projects get anywhere between 100+ and 100,000+ volunteers Zipfian distribution of contributions by volunteers What makes for a successful project? Games "In every job that must be done, there is an element of fun. You find the fun, and – snap! – the job's a game." (Mary Poppins) 9 billion human-hours of solitaire were played in 2003 7 million human-hours to build the Empire State Building, or 6.8 hours out of 2003 20 million human-hours to build the Panama Canal, or one day out of 2003 (Luis von Ahn, "Human Computation") ESP Game Problem: label images with words/captions Purposes index images for search provide captions for visually impaired ESP Game Setup two people, strangers type whatever the other player is typing get points whenever you agree timed only store solutions when n pairs are recorded taboo words from previous solutions random test images to catch cheaters symmetric verification game both players get same input and give same output each player verifies the other ESP Game Results 75,000 players (after one year) 15 million agreements many people play over 20 hours per week highly accurate highly complete large part of appeal is relation with anonymous partner www.espgame.org/ Peakaboom Problem images with object labels e.g. output of ESP Game need to locate objects in images used for training computer vision Peakaboom Setup player A sees image player B has to guess object in image player A clicks on image, revealing small area to player B asymmetric verification game player A gets input, which player B has to guess player B verifies player A's analysis Peakaboom Results 27,000 players in first four months 2,100,000 object locations many people averaged over 12 hours per day for first 10 days www.peekaboom.org/ Verbosity (proposed) Problem input common sense facts e.g. "cereal is eaten with milk" Game player A sees word player B has to guess word player A gets to fill in various templates e.g. "object is typically near ____" asymmetric verification game Toolkits Amazon Mechanical Turk Examples paid service requester posts task online, along with instructions and pay rate worker views available tasks and selects those of interests examine an image and click on specified objects, $0.05 per object evaluate relevance of search results, $0.02 per evaluation www.mturk.com/ Toolkits (continued) Bossa open source, Linux developer provides task-specific PHP scripts system rates volunteer skill, evaluates agreement among volunteers pointer to Bolt, open source tutorial builder boinc.berkeley.edu/trac/wiki/BossaIntro/ Facebook install customized apps take advantage of social networks www.facebook.com/ Linguist@home (a.k.a. That's Your Opinion) annotate sentences with opinions make it fun resources customer data server expert consultant(s) tools Bossa (PHP) or Java Facebook and standalone Linguist@home 1-player game display sentence display list of templates highlight eligible participants determined by expert consultant entities and events allow multiple answers 10 points for first, 20 for second, 30 for third, etc. How can we assure that answers are valid? Linguist@home 2-player game symmetric verification 2 players each play same game points for matched answers Future Work extend to other linguistic subdisciplines extend to other widely used & studied languages e.g. topic classification e.g. German, Chinese extend to fastest growing languages e.g. Arabic sociopolitical factors References ------. Playing or processing. The Economist. Dec 6, 2007. ------. Spreading the load. The Economist. Dec 6, 2007. ------. A world wide web of terror. The Economist. July 12, 2007. Amir Alexander. Aerogel: The "Frozen Smoke" that Made Stardust Possible. The Planetary Society. November 8, 2006. Nathaniel Ayewah, Rada Mihalcea, and Vivi Nastase. Building Multilingual Semantic Networks with Non-Expert Contributions over the Web. Proceedings of the KCAP 2003 Workshop on Distributed and Collaborative Knowledge Capture. Sanibel Island, Florida, November 2003. Timothy Chklovski. 2005. Designing interfaces for guided collection of knowledge about everyday objects from volunteers. In Proceedings of the 10th international Conference on intelligent User interfaces (San Diego, California, USA, January 10 - 13, 2005). IUI '05. ACM, New York, NY, 311-313. Timothy Chklovski, Using Analogy to Acquire Commonsense Knowledge from Human Contributors, MIT Artificial Intelligence Laboratory technical report AITR-2003-002, February 2003. Timothy Chklovski and Rada Mihalcea. Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation. Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP 2003). Borovetz, Bulgaria, September 2003. Kate Land, Anze Slosar, Chris Lintott, Dan Andreescu, Steven Bamford, Phil Murray, Robert Nichol, M.Jordan Raddick, Kevin Schawinski, Alex Szalay, Daniel Thomas, Jan Van den Berg. Galaxy Zoo: The large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey. Submitted March 22, 2008. Chris J. Lintott, Kevin Schawinski, Anze Slosar, Kate Land, Steven Bamford, Daniel Thomas, M. Jordan Raddick, Robert C. Nichol, Alex Szalay, Dan Andreescu, Phil Murray, Jan van den Berg. Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Submitted to MNRAS, April 29, 2008. References (continued) Rada Mihalcea and Timothy Chklovski. Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users' Help. Proceedings of the EACL 2003 Workshop on Linguistically Annotated Corpora (LINC 2003). Budapest, April 2003. Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne, Ian Soboroff. Overview of the TREC-2006 Blog Track. TREC 2006. Luis von Ahn. Games With a Purpose. IEEE Computer Magazine, vol. 39, no. 6, pp. 92-94, June 2006. Luis von Ahn. Human Computation. Google Tech Talks. July 26, 2006. Luis von Ahn, Ruoran Liu and Manuel Blum. Peekaboom: A Game for Locating Objects in Images. ACM CHI 2006. Luis von Ahn, S. Ginosar, M. Kedia, R. Liu and M. Blum. Improving Accessibility of the Web with a Computer Game. ACM CHI 2006. Luis von Ahn, Mihir Kedia and Manuel Blum. Verbosity: A Game for Collecting Common-Sense Facts. ACM CHI 2006. A. J. Westphal, C. C. Allen, R. Bastien, J. Borg, F. Brenker, J. C. Bridges, D. E. Brownlee, A. L. Butterworth, C. Floss, G. J. Flynn, D. Frank, Z. Gainsforth, E. Gruen, P. Hoppe, A. T. Kearsley, H. Leroux, L. R. Nittler, S. A. Sandford, A. Simionovici, F. J. Stadermann, R. M. Stroud, P. Tsou, T. Tyliszczak, J. Warren, M. E. Zolensky. Preliminary Examination of the Interstellar Collector of Stardust. 39th Lunar and Planetary Science Conference (2008), Abstract #1855. Nicholos Wethington. Galaxy Zoo Gets a Makeover. Universe Today. April 23, 2008. Nicholos Wethington. Galaxy Zoo Results Show that the Universe Isn't 'Lopsided'. Universe Today. March 28, 2008. Hui Yang, Luo Si, Jamie Callan. Knowledge Transfer and Opinion Detection in the TREC2006 Blog Track. TREC 2006.