Linguist@home

advertisement
Recruiting Online Volunteers for
Linguistic Knowledge Acquisition





Ed Kenschaft
job talk
May 13, 2008
45 minutes
www.kenschaft.org/papers/linguistathome.html
Outline






The internet is essentially unregulated, immensely huge,
and growing exponentially.
Terrorist groups use the internet for recruiting and training.
Computational linguistics subdisciplines such as opinion
detection can be used to identify terrorist websites.
Most such systems require training data which is not
readily available.
Other research projects have had success recruiting
internet volunteers for comparably difficult tasks.
I propose to do the same with opinion labeling, and then
extend to other related areas.
Challenge: Internet

1.36 billion users (Q1 2008)



across entire populated world and diverse language groups
20.7% annual growth (December 2006 to December 2007)
103,160,364 active domains (May 03, 2008)


332,840,730 deleted domains
648,853 new domains in past 24 hours (May 03, 2008)

619,939 (May 12, 2008)

Source (user info): www.internetworldstats.com/
Copyright © 2008, Miniwatts Marketing Group

Source (domain info): www.domaintools.com/internet-statistics/

Internet Users


most users in Asia, Europe, and North America
fastest growth in Middle East, Africa, and Latin America
Primary Languages
of Internet Users

largely European, East Asian, and Arabic


206 million others
fastest growth in Arabic
Internet Summary




internet is vast, with tremendously fast growth
most content and users are from developed nations and
well-studied languages
fastest growth is in developing nations and less-studied
languages
analysts and technology need to keep up
Challenge: Use of Internet for
Global Terrorism


several thousand terrorist websites, growing exponentially
purposes

propaganda -- worldwide, anonymous







...
"The Global Islamic Call to Resistance", 1600 pages, call for selfstarting terrorist cells
"Questions and Uncertainties Concerning the Mujahideen and their
Operations", doctrinal justifications
news bulletins
videos of American soldiers being blown up
video statements on recent events
video game, "Night of Bush Capturing"
Challenge: Use of Internet for
Global Terrorism

purposes [continued]

training manuals, e.g. assassination, manufacturing
poisons/explosives


coordinate attacks between individuals or groups


"Encyclopedia of Preparation", huge & growing online manual
internet jihadist Irhabi007 helped plan attacks by two men from
Atlanta, GA, on Washington, DC, targets
"... networks within networks, connections within
connections and links between individuals that cross local,
national and international boundaries."

Peter Clarke, head of the counter-terrorism branch of London's Metropolitan Police
Internet Terrorism Summary

"The radicalisation process is occurring more quickly,
more widely and more anonymously in the internet age,
raising the likelihood of surprise attacks by unknown
groups whose members and supporters may be difficult to
pinpoint."


National Intelligence Estimate, USA, 2006
"We have to find a way to stanch the flow. The internet
creates a constant reservoir of radicalised people which
terrorist groups and networks can draw upon."

Professor Bruce Hoffman, terrorism expert, Georgetown University
How can we identify
terrorist websites?
Digression: Humans and
Computers are Different


computers can do many things that humans can't do (well)
humans can do many things that computers can't do (well)
Examples of Differences

computers only



humans only



find new prime numbers
scan the entire web for "Osama bin Laden"
recognize emotions from facial expressions
captcha
both

play chess
Crossover

humans can impersonate computers



long division
find new prime numbers
computers can impersonate humans


Eliza – requires clever rules, limited domain
machine learning – requires lots of data
Opinion Detection


identify opinions and attitudes in texts (more generally,
modalities)
humans are very good at it, computers are not
Opinion Detection (Examples)

"America is a mistake, admittedly a gigantic mistake, but a
mistake nevertheless."



(Sigmund Freud)
SPEAKER DISLIKES America
"The United States of America is a threat to world peace."


(Nelson Mandela)
SPEAKER DISLIKES United States of America
Opinion Detection (continued)

"Mr. McGee, don't make me angry. You wouldn't like me
when I'm angry."




(David Banner)
Mr. McGee SHOULDN'T make me angry
Mr. McGee DISLIKES me when I'm angry
"All I want for Christmas is my two front teeth."


(personal communication)
SPEAKER WANTS my two front teeth
Opinion Detection Resources


humans can do this task well, but not fast enough
computers are moderately successful in limited domains


TREC 2006 & 2007
accuracy of computers depends on availability of training
data
TREC 2006 Blog
(Opinion Retrieval) Track

given a blog entry and a topic, identify whether:




the entry is relevant to that topic
the entry expresses an opinion on the topic
the opinion is positive, negative, or mixed
no training data provided

CMU used ~10,000 training examples from movie and
product reviews

(Yang et al 2006)
TREC 2006 Examples

Opinionated

Skype 2.0 eats its young


The elaborate press release and WSJ review while impressive don’t
help mask the fact that, Skype is short on new ground breaking
ideas. Personalization via avatars and ring-tones... big new idea?
Not really. Phil Wolff over on Skype Journal puts it nicely when he
writes, “If you’ve been using Skype, the Beta version of Skype 2.0 for
Windows won’t give you a new Wow! experience.” ...
Non-Opinionated

Skype Launches Skype 2.0 Features Skype Video

Skype released the beta version of Skype 2.0, the newest version of
its software that allows anyone with an Internet connection to make
free Internet calls. The software is designed for greater ease of use,
integrated video calling, and ...
TREC 2006 Results (MAP)

Topic relevance



Best
Median
42.29%
16.99%
Opinion finding


Best
Median
30.04%
10.59%
Where can we get
training data?
Volunteer Projects




Enlist online volunteers
Provide minimal training
Optionally, frame as a competitive game
"The easiest part is getting the public involved. Most
volunteer-computing projects can draw on tens of
thousands of people with practically no advertising, relying
on word of mouth. The problem is usually keeping these
eager amateurs busy."

("Spreading the load", The Economist)
Non-computational Projects


amateur bird-watchers track bird migrations
amateur astronomers spot new comets
Galaxy Zoo


roughly a million galaxies from Sloan Digital Sky Survey
classify





elliptical
clockwise spiral
anticlockwise spiral
unclear
identify interactions between galaxies, real or illusory
Galaxy Zoo Volunteers




100,000+ volunteers within a few months
30 volunteers classify each galaxy
peak load 70,000 per hour
final datasets



34,617,406 analyses
82,931 users
filter unreliable volunteers using known test cases
Galaxy Zoo Results

unexpected source of error




2 papers submitted for publication
currently over 20 projects underway using resulting data
future work



users are biased toward anticlockwise spirals
phase two: more detailed questions
phase three: more image sources
www.galaxyzoo.org/
Stardust@home

Problem



Volunteers



aerogel sent seven years and 3 billion km through space
identify tracks of microparticles in gel
24,000 participants
40 million searches in under a year
Results



50 candidate dust particles, each identified by hundreds of
participants
featured in seven conference papers
stardustathome.ssl.berkeley.edu/
Herbaria@home


thousands of 19th-century plant specimens with
handwritten notes
read notes and enter information into database
Herbaria@home Volunteers

162 volunteers, Zipfian distribution



68 volunteers transcribed 10 or more entries
24 volunteers transcribed 100 or more entries
7 volunteers transcribed 1000 or more entries
Herbaria@home Results



22702 specimens documented (May 5, 2008)
no redundancy
herbariaunited.org/atHome/
Open Mind Word Expert

word sense disambiguation



He boarded the plane from gate 53.
The ball is not in play until it crosses the plane.
systems need training data
Open Mind Word Expert
Results





90,000 sense taggings over four months
240 words, 87 examples each on average
inter-annotator agreement: 66.56%
66.23% precision, vs. 63.32% baseline
best precision for words with most training examples
Volunteer Projects Summary


projects get anywhere between 100+ and 100,000+
volunteers
Zipfian distribution of contributions by volunteers
What makes for a
successful project?
Games

"In every job that must be done, there is an element of
fun. You find the fun, and – snap! – the job's a game."


(Mary Poppins)
9 billion human-hours of solitaire were played in 2003


7 million human-hours to build the Empire State Building,
or 6.8 hours out of 2003
20 million human-hours to build the Panama Canal,
or one day out of 2003

(Luis von Ahn, "Human Computation")
ESP Game


Problem: label images with words/captions
Purposes


index images for search
provide captions for visually impaired
ESP Game Setup








two people, strangers
type whatever the other player is typing
get points whenever you agree
timed
only store solutions when n pairs are recorded
taboo words from previous solutions
random test images to catch cheaters
symmetric verification game


both players get same input and give same output
each player verifies the other
ESP Game Results

75,000 players (after one year)


15 million agreements




many people play over 20 hours per week
highly accurate
highly complete
large part of appeal is relation with anonymous partner
www.espgame.org/
Peakaboom

Problem

images with object labels



e.g. output of ESP Game
need to locate objects in images
used for training computer vision
Peakaboom Setup




player A sees image
player B has to guess object in image
player A clicks on image, revealing small area to player B
asymmetric verification game


player A gets input, which player B has to guess
player B verifies player A's analysis
Peakaboom Results




27,000 players in first four months
2,100,000 object locations
many people averaged over 12 hours per day
for first 10 days
www.peekaboom.org/
Verbosity (proposed)

Problem

input common sense facts


e.g. "cereal is eaten with milk"
Game



player A sees word
player B has to guess word
player A gets to fill in various templates


e.g. "object is typically near ____"
asymmetric verification game
Toolkits

Amazon Mechanical Turk




Examples



paid service
requester posts task online, along with instructions and pay
rate
worker views available tasks and selects those of interests
examine an image and click on specified objects,
$0.05 per object
evaluate relevance of search results, $0.02 per evaluation
www.mturk.com/
Toolkits (continued)

Bossa






open source, Linux
developer provides task-specific PHP scripts
system rates volunteer skill, evaluates agreement among
volunteers
pointer to Bolt, open source tutorial builder
boinc.berkeley.edu/trac/wiki/BossaIntro/
Facebook



install customized apps
take advantage of social networks
www.facebook.com/
Linguist@home
(a.k.a. That's Your Opinion)



annotate sentences with opinions
make it fun
resources

customer




data
server
expert consultant(s)
tools


Bossa (PHP) or Java
Facebook and standalone
Linguist@home
1-player game


display sentence
display list of templates


highlight eligible participants


determined by expert consultant
entities and events
allow multiple answers

10 points for first, 20 for second, 30 for third, etc.
How can we assure that
answers are valid?
Linguist@home
2-player game

symmetric verification


2 players each play same game
points for matched answers
Future Work

extend to other linguistic subdisciplines


extend to other widely used & studied languages


e.g. topic classification
e.g. German, Chinese
extend to fastest growing languages


e.g. Arabic
sociopolitical factors
References

------. Playing or processing. The Economist. Dec 6, 2007.

------. Spreading the load. The Economist. Dec 6, 2007.

------. A world wide web of terror. The Economist. July 12, 2007.

Amir Alexander. Aerogel: The "Frozen Smoke" that Made Stardust Possible. The Planetary Society. November 8, 2006.






Nathaniel Ayewah, Rada Mihalcea, and Vivi Nastase. Building Multilingual Semantic Networks with Non-Expert Contributions
over the Web. Proceedings of the KCAP 2003 Workshop on Distributed and Collaborative Knowledge Capture. Sanibel Island,
Florida, November 2003.
Timothy Chklovski. 2005. Designing interfaces for guided collection of knowledge about everyday objects from volunteers. In
Proceedings of the 10th international Conference on intelligent User interfaces (San Diego, California, USA, January 10 - 13,
2005). IUI '05. ACM, New York, NY, 311-313.
Timothy Chklovski, Using Analogy to Acquire Commonsense Knowledge from Human Contributors, MIT Artificial Intelligence
Laboratory technical report AITR-2003-002, February 2003.
Timothy Chklovski and Rada Mihalcea. Exploiting Agreement and Disagreement of Human Annotators for Word Sense
Disambiguation. Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP 2003).
Borovetz, Bulgaria, September 2003.
Kate Land, Anze Slosar, Chris Lintott, Dan Andreescu, Steven Bamford, Phil Murray, Robert Nichol, M.Jordan Raddick, Kevin
Schawinski, Alex Szalay, Daniel Thomas, Jan Van den Berg. Galaxy Zoo: The large-scale spin statistics of spiral galaxies in
the Sloan Digital Sky Survey. Submitted March 22, 2008.
Chris J. Lintott, Kevin Schawinski, Anze Slosar, Kate Land, Steven Bamford, Daniel Thomas, M. Jordan Raddick, Robert C.
Nichol, Alex Szalay, Dan Andreescu, Phil Murray, Jan van den Berg. Galaxy Zoo : Morphologies derived from visual inspection
of galaxies from the Sloan Digital Sky Survey. Submitted to MNRAS, April 29, 2008.
References (continued)


Rada Mihalcea and Timothy Chklovski. Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users'
Help. Proceedings of the EACL 2003 Workshop on Linguistically Annotated Corpora (LINC 2003). Budapest, April 2003.
Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne, Ian Soboroff. Overview of the TREC-2006 Blog Track. TREC
2006.

Luis von Ahn. Games With a Purpose. IEEE Computer Magazine, vol. 39, no. 6, pp. 92-94, June 2006.

Luis von Ahn. Human Computation. Google Tech Talks. July 26, 2006.

Luis von Ahn, Ruoran Liu and Manuel Blum. Peekaboom: A Game for Locating Objects in Images. ACM CHI 2006.



Luis von Ahn, S. Ginosar, M. Kedia, R. Liu and M. Blum. Improving Accessibility of the Web with a Computer Game. ACM CHI
2006.
Luis von Ahn, Mihir Kedia and Manuel Blum. Verbosity: A Game for Collecting Common-Sense Facts. ACM CHI 2006.
A. J. Westphal, C. C. Allen, R. Bastien, J. Borg, F. Brenker, J. C. Bridges, D. E. Brownlee, A. L. Butterworth, C. Floss, G. J.
Flynn, D. Frank, Z. Gainsforth, E. Gruen, P. Hoppe, A. T. Kearsley, H. Leroux, L. R. Nittler, S. A. Sandford, A. Simionovici, F. J.
Stadermann, R. M. Stroud, P. Tsou, T. Tyliszczak, J. Warren, M. E. Zolensky. Preliminary Examination of the Interstellar
Collector of Stardust. 39th Lunar and Planetary Science Conference (2008), Abstract #1855.

Nicholos Wethington. Galaxy Zoo Gets a Makeover. Universe Today. April 23, 2008.

Nicholos Wethington. Galaxy Zoo Results Show that the Universe Isn't 'Lopsided'. Universe Today. March 28, 2008.

Hui Yang, Luo Si, Jamie Callan. Knowledge Transfer and Opinion Detection in the TREC2006 Blog Track. TREC 2006.
Download