I AnnoTool: Crowdsourcing for Natural Language Corpus Creation by Katherine Hayden B.A., Bennington College (2011) Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences MASSACHUSETTS INSTITUTE -F TECHNOLOGY at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2013 JUL 14 2014 LIBRARIES @ Massachusetts Institute of Technology 2013. All rights reserved. Signature redacted Author ................ Projram in Media Arts and Sciences September 10, 2013 Signature redacted . Certified by . . Catherine Havasi Assistant Professor of Media Arts and Sciences Thesis Supervisor Signature redacted ...................... Patricia Maes Associate Academic Head, Program in Media Arts and Sciences Accepted by. .................... . AnnoTool: Crowdsourcing for Natural Language Corpus Creation by Katherine Hayden Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning on September 10, 2013, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences Abstract This thesis explores the extent to which untrained annotators can create annotated corpora of scientific texts. Currently the variety and quantity of annotated corpora are limited by the expense of hiring or training annotators. The expense for finding and hiring professionals increases as the task becomes more esoteric or requiring of a specialized skill set. Training annotators is an investment in itself, often difficult to justify. Undergraduate students or volunteers may not remain with a project for long enough after being trained and graduate students' time may already be prioritized for other research goals. As the demand increases for computer programs capable of interacting with users through natural language, producing annotated datasets with which to train these programs is becoming increasingly important. This thesis presents an approach combining crowdsourcing with Luis von Ahn's "games with a purpose " paradigm. Crowdsourcing combines contributions from many participants in an online community. Games with a purpose incentivize voluntary contributions by providing an avenue for a task people are already incentivized to do, and collect data in the background. Here the desired data are annotations and the target community people annotating text for professional or personal benefit, such as scientists, researchers or the general public with an interest in science. An annotation tool was designed in the form of a Google Chrome extension specifically built to work with articles from the open-access, online scientific journal Public Library of Science (PLOS) ONE. A study was designed where participants with no prior annotator training were given a brief introduction to the annotation tool and assigned to annotate three articles. The results of the study demonstrate considerable annotator agreement. The results of this thesis demonstrate that crowdsourcing annotations is feasible even for technically sophisticated texts and presents a model of a platform that continuously gathers annotated corpora. Thesis Supervisor: Catherine Havasi Title: Assistant Professor of Media Arts and Sciences 3 4 AnnoTool: Crowdsourcing for Natural Language Corpus Creation by Katherine Hayden The following people served as readers for this thesis: Thesis Reader. Signature red acted Cynthia Breazeal Principal Research Scientist Media Lab Signature redacted Thesis Reader.. Sepandar Kamvar Principal Research Scientist Media Lab 5 Acknowledgments My advisor, CatherineHavasi, for mentoring and inspiring me. From Principal Researcher at MIT Media Lab to CEO of your own startup Luminoso, you demonstrate that with unusual intelligence and unbelievable hard work, anyone can simultaneously succeed at multiple challenging roles. Thank you for your ever-illuminating feedback and smart advice. Cynthia Breazeal and Sep Kamvar who generously served as my readers. Your guidance was critical and kept me on the right course. Linda Peterson, for her saintly patience and practical impatience. Thank you for your genuince care, excellent admin and heartwarming faith in me. Joi Ito, for adopting the Media Lab two years ago and caring for its people like your extended family. You have a heart of gold. Friends from many communities: The particularly excellent crop of human beings that made up New Grads 2011, whose brilliance and playfulness made for delightful company. The community surrounding the MIT Triathlon and Cycling teams, where I found a side of my life I had been missing and heroes in my teammates. My roommates RJ Ryan and Zack Anderson, who rubbed off on me in so many little ways, from allegiance to The One True Editor to sharper peripheral vision for live wires. Finally, my family, who encouraged me from afar and to whom I owe my next few holidays. 7 8 Contents 1 Introduction 11 2 Background 15 3 4 5 2.1 Data Quality. . . . . . . .. 2.2 Games With a Purpose 2.3 ConceptNet . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . Technology 18 21 3.1 Backend . ....... 3.2 User Interface . . . . . . .. . .. . . . . . .. . . . . . .. . . .. . .. . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . 22 Study 27 4.1 Testers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Setup 4.3 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 33 Results and Conclusion 5.1 Study participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Tag Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.4 Feedback 5.5 Inter-Annotator Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 9 6 Future Work 45 10 Chapter 1 Introduction Technology is an ongoing march towards better and more impressive iterations of current tools and applications. Computer software has followed this trend since the 1950s, when the earliest group of professional programmers first secured access to slices of time to tinker with the cutting edge mainframe computers of the day and pioneers established the research field of Artificial Intelligence at a Dartmouth conference. Ever since, researchers in the field have been driven by questions such as, "How do we make a machine intelligent enough that a human conversing with it could mistake it to be not a machine, but a person?" Attempts to build such machines revealed the previously underrated difficulty of engineering a system that approached anything even resembling intelligence. However, the attempt to do has had beneficial outcomes. It has produced smarter applications, such as voice recognition software capable of recognizing and translating spoken into written speech. It has also spawned new fields, such as Natural Language Processing, the goal of which is to enable computers to derive meaning from human language. The combination of these technologies produced natural language applications which take spoken user input, transcribe it to text and analyze its meaning using models of language developed through natural language processing. Although these programs' intelligence is limited to performing a specific range of tasks, they are certainly valuable tools, as evidenced by the increasing prevalence of their usage. Apple iPhone 4s users have access to Siri [6, 52], Apple's intelligent personal assistant which is capable of answering questions and performing actions commanded by the user, 11 such as searching for local businesses and setting appointments on the phone's calendar. A study released in late March of 2012 found that 87% of iPhone 4s owners used Siri [11]. At that time Siri had been available for Q4 2011 and Qi 2012, in which 7.5 million iPhone units and 4.3 million were activated, respectively. That results in an estimate of 10.3 million Siri users by March 2012. Web users have access to Wolfram Alpha, an online 'knowledge engine' that computes an answer directly from the user's query, backed by a knowledge base of externally sourced, curated structured data [65]. Wolfram Alpha can answer fact-based queries such as where and when a famous person was born. It can even decipher more complex queries such as "How old was Queen Elizabeth II in 1974?" Usage statistics for Wolfram Alpha estimate 2.6 million hits, or page visits, daily [67, 68]. In addition to these major players, there is continued investment in and development of many new natural language applications [17, 24, 36, 37, 55, 70]. However, despite the popularity of such applications, users have reported feeling disappointed in the gap between their expectations of these applications and their true performance. Some argue that as a program presents itself with more human-like attributes, for example communicating in natural language, people become more likely to have unreasonable expectations of it to be as intelligent as a human and get frustrated when it doesn't perform with real intelligence. Others argue that these applications were overly hyped and are angry enough to sue when they don't perform as advertised [32, 35]. Regardless of the source of these overly high expectations, there is a common theme of disappointment in the lack of sophistication: "And for me, once the novelty wore off, what I found was that Siri is not so intelligent after all - it's simply another voice program that will obey very specific commands" [35]. People want natural language applications with more range in their knowledge of the world and more intelligence about how they can interact with it. One effective way to create smarter natural language applications is to use supervised machine learning algorithms. These algorithms are designed to extrapolate rules from the 12 knowledge base of texts they are given in order to apply those rules to unseen circumstances in the future [45, 46]. Supervised machine learning algorithms take annotated text as input. This text has been augmented with metadata that helps the algorithm identify what are the important elements and the categories in which they should be classified. This approach is effective but suffers from the bottleneck of creating a large number of annotated corpora, or bodies of text. Annotated corpora are expensive to make and they require someone to initiate creating them [8]. The result of this is that existing annotated corpora are concentrated in fields that are well-funded or of potential commercial interest. Fields that are not well-funded or seemingly lucrative lack annotated datasets. In general, this includes most scientific fields, with the exceptions of certain biomedical fields which are well-funded and for which there have been attempts in the past to create natural language applications to assist doctors. Examples include programs that operate assistive machines hands-free during surgeries, expert systems which assist reasoning through possible diagnosis, and software with virtual doctors who serve as automated assistants, pre-screening patients by having them first describe their symptoms prompting their doctor visit. This lack of annotated corpora for certain fields is problematic. On one hand it renders it nearly impossible to create sophisticated natural language applications for those fields. But it also has the effect that any natural language application cannot achieve a certain level of general intelligence due to gaps in knowledge related to those unrepresented fields. Therefore it is incredibly important, if we ever wish to have applications that are even more surface-level intelligent, to have datasets across a very wide range of fields. And to do that we need to figure out how to create them much more cheaply and easily. This thesis focuses on not only increasing the number of annotated datasets for underrepresented areas, but in demonstrating a model for doing so in an automated and nearly free manner. 13 14 Chapter 2 Background 2.1 Data Quality Snow et al., in their influential and oft-cited study "Cheap and Fast - But is it Good?: EvaluatingNon-ExpertAnnotationsfor NaturalLanguage Tasks" [49], address the issue of data quality in crowdsourced annotations. They employ Amazon Mechanical Turk workers, also know as Turkers, to perform five natural language annotation tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation [2]. Their results for all five tasks demonstrate high agreement between the Turkers non-expert nnotations and gold-standard labels provided by expert annotators. Specifically for the task of affect recognition, they demonstrate on par effectiveness between using non-expert and expert annotations. Snow et al. conclude that many natural language annotation tasks can be crowdsourced for a fraction of the price of hiring expert annotators, without having to sacrifice data quality. In addition to establishing the viability of non-expert crowdsourced annotators, Snow et al. address further improving annotation quality by evaluating individual annotator reliability and propose a technique for bias correction. Their technique involves comparing Turker performance against gold standard examples. The Turkers are then evaluated using a voting system: When more than 50% accurate the Turkers receive positive votes, when the Turkers judgements are pure noise they receive zero votes, and Turkers whose responses are anticorrelated to the gold standard receive negative votes. 15 In addition to this more in-depth approach, they note that two alternate methods of controlling for annotator reliability are simply using more workers on each task or using Amazon Mechanical Turk's payment system to reward high-performing workers and demote low-performing ones. Crowdsourced natural language annotation systems with a humanbased computation games approach preclude controlling for quality through a reward system. The implications for this are that quality will be controlled through a combination of annotator bias modelling and annotator quantity. Finally, Snow et al. address training systems without a gold-standard annotation and report the surprising result that for five of the seven tasks, the average system trained with a set of non-expert annotations outperforms one trained by a single expert annotator. Nowak et al. corroborate this finding in "How Reliable are Annotations via Crowdsourcing: A study about inter-annotatoragreementfor multi-label image annotation," and summarize that "the majority vote applied to generate one annotation set out of several opinions, is able to filter noisy judgments of non-experts to some extent. The resulting annotation set is of comparable quality to the annotations of experts" [41]. Further work on annotation quality provides additional methods for screening data. Ipeirotis et al. argue that although many rely on data quantity and redundancy (one of the three data quality methods proposed by Snow et al.) for quality control, redundancy is not a panacea. They present techniques enabling the separation of a worker's bias and error and generate a scalar score of each worker's quality [31]. A collaboration between researchers at CrowdFlower, a crowdsourcing service like Amazon Mechanical Turk, and eBay examined worker data quality beyond a single instance in an effort to ensure continued high data quality [33]. They concluded that the best approach consisted of an initial training period followed by sporadic tests, where workers are trained against an existing gold standard. More precisely, they demonstrate that the subsequent training tasks should follow a uniform distribution. Le et al. make the analogy of training a classifier to training human workers. They point out that both need a training set but that distributions of that training set differ in either case. A training set for a machine classifier uses a randomly selected training set which should approximate the underlying distribution. Using the same distribution for humans biases them towards the labels with 16 the highest prior. Therefore, training questions should predispose no bias and be uniform. From these studies on non-expert annotation have demonstrated the viability of using non-expert annotators to produce near-gold standard annotations and provided an array of methods, from bias correction to filtering, for further refining the quality of the annotations. 2.2 Games With a Purpose Establishing the viability of using non-expert annotators to create quality annotated corpora is an important step to reducing the barriers to creating such datasets. Crowdsourcing services like Amazon Mechanical Turk and CrowdFlower make it easy to almost instantly find any number of workers to complete tasks. However, these services cost money. The cost is nominal, often hovering around minimum wage, yet is still an impediment to corporabuilding efforts, especially when scaled to large-scale endeavors. It would therefore be desirable to find an alternative approach to hiring annotators. "Games with a Purpose (GWAP)," also known as human-based computation games, attempt to outsource computational tasks to humans in an inherently entertaining way. The tasks are generally easy for humans to do but difficult or impossible for computers to do. Examples include tasks which rely on human experience, ability or common sense [48, 63, 64] Luis von Ahn, a Computer Science professor at Carnegie Mellon University created the paradigm. The first GWAP developed was The ESP Game, which achieved the task of labeling images for content. The game enticed players to contribute labels by its entertaining nature; two randomly-paired online players attempt to label the image with the same words and get points only for their matches. The game is timed and provides a competitive factor, while simultaneously encouraging creativity in brainstorming appropriate labels. A more recent and complex example of a GWAP, also by von Ahn, is the website Duolingo [23, 40, 47, 61]. Duolingo provides a free platform for language learning, an inherently desirable activity for many people, while achieving text translation. Users translate article excerpts as language learning practice. Translations are corroborated across many users, and translated excerpts are combined into fully translated articles. 17 Essentially, GWAPs are appropriate in circumstances where it would be desirable to outsource a computational task to humans and where it would be possible to do this in a way that would be satisfy a human need for a service or game. A web-based tool for scientific article annotation would fall under this category; It would gather annotations to compile in a dataset valuable for training future natural language applications while providing annotators with a service to conveniently annotate their research articles and share those annotations with collaborators. Importantly, such an annotation tool would create annotated corpora without requiring continuous funding for annotators. The upfront cost of building an annotation platform and the framework for collecting and refining the gathered data is an increasingly worthwhile investment the more the tool is used and the larger and more refined the annotated corpora grows [14, 38, 50]. 2.3 ConceptNet In addition to Games with a Purpose, another method for crowdsourcing the creation of knowledge bases is by simply asking for them from online volunteers. Open Mind Common Sense (OMCS) is a project from the MIT Media Lab that recruits common sense facts about the everyday world from humans, such as "A coat is used to keep warm" or "Being stuck in traffic causes a person to be angry." OMCS provides a fill-in-the-blank structure which pairs two blanks with a connecting relationship. OMCS provides relations such as "is-a," "made-of," "motivatedcby-goal" and many more. As an example, a volunteer could input "A saxophone is used-for jazz" by selecting the "usedfor" relationship from a drop-down menu and filling in the paired elements in the blanks [51]. A follow-up project by Catherine Havasi and other MIT researchers who were involved with OMCS built a semantic network called ConceptNet [29] based on the information in the OMCS database. ConceptNet parses the natural-language assertions from OMCS, and creates a directed graph of knowledge with concepts as nodes and the relations as edges. ConceptNet has developed to incorporate further resources such as WordNet, Wiktionary, Wikipedia through DBPedia, ReVerb and more, all of which require personalized 18 parsing and integration into the ConceptNet semantic network. This structured information can be used to train machine learning algorithms. Open Mind Common Sense is similar to an online annotation tool like AnnoTool in that they are both crowdsourced knowledge-acquisition platforms and different in that OMCS relies on volunteers rather than a GWAP approach. Like OMCS, AnnoTool's annotated corpora could be another data source which ConceptNet includes in its collection of resources. In conclusion, AnnoTool relies on gathering annotated corpora through non-expert annotators, which previous research has shown to produce high quality annotations [41, 42, 43, 44]. Specifically, its annotator base is drawn not from paid crowdsourcing services like Amazon Mechanical Turk, but from users who are already inherently motivated to engage because the system offers them an avenue for their annotation and collaboration tasks. This aligns with a Games with A Purpose approach and allows the system to grow unlimited by financial resources. Finally, similar knowledge bases, like that gathered by the Open Mind Common Sense project, show a concrete example of how the results can be useful for contributing to systems which train machine learning algorithms. 19 20 Chapter 3 Technology 3.1 Backend AnnoTool was built with the web framework Django [20, 21], version 1.5.4, and a PostgreSQL database, version 9.2. Django is an open source web framework that follows the model-view-controller archicecture. It is run in conjunction with an Apache server, version 2.4.6. The model designed includes three classes: Article, User and Annotation [18]. Annotations have a many-to-one relationship with a User meaning that each AnnoTool user has multiple annotations associated with their username. Similarly, Annotations are associated with a specific Article. If in the future AnnoTool were extended to support annotations across documents, an Annotation object have a list of the Articles to which the instance referred. After configuring Django to connect to the PostgreSQL database, a simple syncdb command updates the database to reflect the models. Users interact with AnnoTool as a Google Chrome extension [26, 27]. Google defines Chrome extensions as "Small programs that add new features to your browser and personalize your browsing experience." Chrome extensions are built to provide extra functionality to Google's browser, Chrome, and can take the form of programs that run in the background, without the user's interaction beyond installing them, to new toolbfars or menus that appear when the user visits certain pages or clicks on an icon to open the tool. 21 AnnoTool in its current form is built so that it opens on three specific PLOS article pages for testing purposes, but in its final form will allowed to appear on any PLOS ONE article pages. Upon saving an annotation, the Chrome extension makes an HTTP POST request to Django to write the annotation to the PostgreSQL database [19]. 3.2 User Interface After the user installs AnnoTool and navigates to an article on the Public Library of Science website, a toolbox appears in the upper right corner of the page. The toolbox appears on the page of any PLOS article and automatically disappears upon leaving a PLOS article page. In order to use the tool, the user must log in with their username and password on the Options page, accessible through the Chrome extensions list at the url chrome: // extensions. In the test version, AnnoTool usernames and passwords are assigned rather than created by the users themselves. It is not possible to create a username/password pair without Django administrator privileges [56]. In the production version of the tool users will be able to create their own accounts and will only need to sign in once after installation. Once a new user successfully saves their login credentials in the Options page, they can return to the article and begin annotating. A user may have one of many differing annotation approaches. One user may wish to annotate the article in multiple passes, highlighting with one tag per pass. In the backend, all tags are represented as belonging to the same radio button set. The implications of this are that only one tag can be selected at a time and a tag remains selected until a different tag is chosen. What this means for "multiple pass" users is that they do not need to redundantly select the same tag for each new annotation. In the most streamlined scenario, they do not need to select an distinct more than once per article. One suggestion given during user feedback was to add a functionality whereby the user did not need to save after every tag. This is certainly a feature to be added in the production version of AnnoTool, further reducing the user's keystrokes to only the most necessary. 22 The presentation of the tags are separated into their respective groups (Figure 3.1). The terms within groups are listed by their acronyms, while the full text is given over mouseover. The color of the tags' highlight is the radio button's border color. When a word or phrase is highlighted on the page, this is the color that will be used for the highlight. If a user wished to group tags differently, rename tags, change the highlighter color or add and delete tags or groups, they would navigate to the Options page to see an editable list of current groups and tags. The user's configuration is saved, so the user is able to truly personalize the tool to their annotation needs. Future versions would allow for downloading and uploading configurations. This would facilitate backup and sharing configurations. Beneath the two tag groups is an input box for the highlighted text. The box itself is uneditable by hand. The way a user submits the desired highlighted text into the box is through a key shortcut: On a Windows operating system, the user holds down the Control key, while clicking and dragging over text with the mouse. They keep the Control key depressed until after they lift up on the mouse click. This transfers the highlighted text to the highlighted text input box. A Mac user would perform the same steps with one slight alteration- they would first highlight a phrase with their mouse, then while keeping the mouse depressed would select the Control key. Finally, the user would release the mouse key first before releasing Control, the same as on a Windows computer. Underneath the input text is a larger text area for writing annotation notes. Although selecting a highlighted text string and a tag are both mandatory to save an annotation (and their absence on a save attempt will trigger an error alert), any additional notes are entirely optional. In addition the toolbox the user interacts with when annotating articles, AnnoTool consists of an Options page where the user can add, delete, reorder and regroup tags, as well as specify a different highlighter color (Figure 3.3). In the current testing iteration, upon load a user is provided the current default tagset, consisting of two groups of tags, to be described thoroughly in the "Testing Setup" section. 23 Figure 3.1: AnnoTool Chrome extension user interface. 24 Non invasive brain to-Broin Interface (BBI): Establishing Functional Links between Two Brains 5so-g-Sk YWU hyugM Kim. Em-4 u Fd , TE Sye JWW TaOWg .ss Uk P. H M- " F "1 1"" M E 88SE8 * bijatie - -1e HidFi Abstract 'ai"" " A "106 F rum.rc. re e Transcranial focused ultrasound (FUS) is capable of modulating the neural activity of specific brain regions, with a potential role as a non-invasive computer-to-brain interface (CBI). In conjunction with the use of brain-to-computer interface (BC) techniques that translate brain function to genierate computer comitands, we investigated the feasibility ofusing the FUS-based CBI to noninvasively establish a functional link between the brains of different species (i.e. human and Sprague-Dawley rat), thus creating a brain-to-brain interface (BBI). The implementation was aimed to non-invasively translate the human volunteer's intention to stimulate a rat's brain motorarea that is responsible for the tail Media Coverage of This Article Posted by PLoSONEGroup rat's ass Posted by 0,1111MU -. 14. P.ft D-- between regions. One was more frequent among SW Altaian Kazakhs (haplotype #3). while the other appeared at low frequencies in both locations (haplotype #1). Both of these haplotypes belonged to haplogroup C3*. No other hapiatypes shared when considering the full 17-STR profile. r is ie n were 8TER sE M. E..aM, 8 83F .es asu r(seramm rGB Fgurr 2. Reduni uediaa-mediafJolfagaetwgrk of Altalan Kaank ving 144[T aloyp. do Ar1tq"Nasint.4WlOns We also reduced the 17-STR profile to a 5-STR profile (DYS3891. DYS390. DYS391, DYS392 and DYS393) to compare the Altaian Kazakh data with published data sets (Figure 3). As a result of this reduction, the 51 Altaian Kazakh haplotypes were collapsed into 21 haplotypes, and the number of shared haplotypes increased accordingly for haplogroups C3*. C3c and 03a3c*. Even so. RS Figure 3.2: (A) AnnoTool sits at the top right corner of the web page. (B) As the user scrolls through the article, the toolbox alongside the article. 25 Figure 3.3: AnnoTool Chrome extension options screen, where groups can be created and highlighter colors set. 26 Chapter 4 Study 4.1 Testers Study participants consisted of volunteers recruited through an email requesting testers and hired through Amazon Mechanical Turk. Amazon Mechanical Turk workers are referred to as Turkers in the Amazon Mechanical Turk documentation, on the official website and colloquially among the community. Of the eighteen total participants; eight of the participants were volunteers and ten were Turkers. One of the Turker's results were not excluded after he made only one tag total. Mechanical Turk requesters locate Turkers by posting a Human Intelligence Task (HIT) which describes what the task entails, the estimated time to completion and the compensation pending a successfully completed task [3, 4]. The HIT for this study described an HIT testing a Google Chrome extension by annotating three selected science articles from the Public Library of Science. The Turkers would be required to spend 20 - 30 minutes annotating each of the three articles. The overall estimated time for the HIT was 1.5 hours. Compensation was offered at $15, amounting to a $10 per hour wage. This hourly wage is higher than the average Mechanical Turk HIT, both because of the complexity of the task and in light of discussions around ethical treatment of Turkers, highlighted for the computational linguistics community in particular by Fort et al. in their paper "Amazon Mechanical Turk: Gold Mine or Coal Mine" [ 1 ]. 27 4.2 Setup Testers were presented with a consent form approved through the Committee on the Use of Humans as ExperimentalSubjects (COUHES). Upon agreement to participate in the study, testers were redirected to an instructions page. The instructions began with a brief explanation of the background of the study, namely that AnnoTool is an avenue for creating crowd-sourced annotated corpora using PLOS articles. This was followed by a high-level overview directing users to an approximately 3-minute video uploaded to YouTube demonstrating how to install AnnoTool in Chrome, common errors in setup, usage of the tool, and a brief overview of the articles to annotate and the collections of tags to use when annotating each article. Testers were recruited through MIT students and contacts as well as Mechanical Turk workers. Once testers agreed to participate, I created a username and password for them through Django command line tools, which they entered in AnnoTool's Options page to sign in. Only once signed in were testers able to save highlights to the database and have the highlights appear on their article. Other built-in checks against data corruption ensured that the tester was required to highlight a term before saving (meaning no annotation notes were unaccompanied by a highlighted term or phrase) and that the highlighted text must have selected a tag by which to classify it. Three articles were chosen for testing purposes: 1. "InflatedApplicants: Attribution Errorsin PerformanceEvaluation by Professionals"[54] 2. "Y-Chromosome Variation in Altaian Kazakhs Reveals a Common Paternal Gene Poolfor Kazakhs and the Influence of Mongolian Expansions" [22] 3. "Non-Invasive Brain-to-BrainInterface (BBI): Establishing FunctionalLinks between Two Brains." [7 1] The articles were chosen by virtue of a number of factors: belonging to the "Most Viewed" category and so more likely to appeal on average to the testers, containing a wide variety of taggable words and phrases and for covering a span of separate fields. 28 Testers were instructed to annotate each article with a different tagset. The first article was to be annotated with the tagset "Science-specific", the second with "Linguistic Data Consortium" and the third with a single tag; the "Term (TER)" tag. 4.3 Tags The Linguistic Data Consortium tag group is based on the extended tag group proposed by the Linguistic Data Consortium (LDC), "an open consortium of universities, companies and government research laboratories" that "creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes" [52]. In Version 6.5 of "Simple Named Entity Guidelines" the LDC proposed a core set of tags including: " Person (PER) " Organization (ORG) " Location (LOC) " Title/Role (TTL) Complements to this core tagset include: " Date (DAT) * Time (TIM) " Money (MON) " Percent (CNT) which are included in the named entity tagset for the Stanford Parser. The Science-specific tag group is an experimental set of tags created based on frequently identifiable terms and phrases observed in PLOS articles. This tagset can be refined based on usage statistics. Tags that are frequently used or that have a high degree of 29 agreement when used can be retained while less useful are replaced by user-supplied tags. In the first iteration, there are seven tags: " Term (TER) * Acronym (ACR) * Method (METH) " Hypothesis (HYP) " Theory (THY) * Postulation (POST) * Definition (DEF) The first tag is "Term." Term is used to identify a word or phrase particular to an article or scientific field. These are frequently easy to spot by appearing multiple times in the article, sometimes in slightly altered forms, or by being followed by an acronym in parentheses. The "Acronym" tag is fairly self-explanatory and is used in the normal definition of the word. Acronyms are usually introduced in parentheses following the full text for which they stand. The "Method" tag often references a phrase or even multiple sentences. Occasionally the usage will be obvious, when the paper includes phrasing like "Our method involved..." which specifically introduce the method. A synonym for method, algorithm, also identifies phrases where the Method tag is applicable. Otherwise the method will be described as a series of steps and experiment design, but not explicitly named. The "Hypothesis" tag can reference both the experimenter's hypothesis and various hypotheses from related or background work. Similar to the Method tag, a phrase may be obvious by explicitly containing the word hypothesis. The "Theory" tag represents an established hypothesis. Although appropriate instances to use this tag are more rare, they are also easily recognizable as Theory is often part of a capitalized phrase. 30 A postulation is something that can be assumed as the basis for argument. Therefore, the "Postulation" tag is meant to identify one or more phrases that precede the experimenters' assertion or hypothesis. This is an example of a more high-level tag. Finally, the "Definition" tag is for a phrase where a new term is defined. Note that tags can exist within tags, and it is common to find a Term tag embedded within a Definition tag. 31 32 Chapter 5 Results and Conclusion 5.1 Study participants Volunteers on average annotated far less thoroughly than Turkers. Of the eight volunteers, only four completed all four articles, as opposed to all of the Turkers. Turkers also gave thorough feedback in addition to an explicit request for time to completion estimates for each of the three separate tasks, discussed in the Feedback section below. 5.2 Tag Usage In terms of usage, the least used tag was "Money (MON)." This is explainable by the content of the articles themselves, as very few discussed money or finance in general. This supports the decision to create new, science-specific terms, as it is evident that the traditional tagset under the "LDC" group heading are less applicable in the context of scientific articles from PLOS. Following closely in the category of least-used terms are "Date (DAT)", "Title/Role (TTL)", "Percent (CNT)" and finally, "Theory (THY)." The first three all belong to the LDC category as well, quantitatively reinforcing the observation that the traditional tagsets are less applicable in this context, and that a tailored science-specific tagset is more relevant. The "Theory" tag is understandably less used, both because properly tagged phrases for 33 theory simply occur infrequently and because the term shares overlap with "Hypothesis (HYP)" tag. On the end of the spectrum for heavily used tags was "Term (TER)," followed by "Postulation (POST)", "Location (LOC)" and "Method (METH)" and "Person (PER)". Terms were ubiquitous in scientific articles, which reflects what linguistics have observed about academic language [ref to Academic Lang paper], that a specialized vocabulary performs an important function of expressing abstract and complex concepts, and is thus crucial and widespread. Given the objective to present scientific studies and their conclusions so that they are both logically followed and accepted by the reader, explanations of logical axioms (Postulations) and step by step processes (Methods) account for the frequency of POST and METH tags. Although at first glance it seems curious that "Definition (DEF)" tags did not occur on par with Term's frequency, there is a plausible explanation. A term need only be defined once and is used many more times afterwards. Furthermore, although an article contains many scientific terms, as evidenced by the abundance of Term tags, most are assumed to be a standard part of the repertoire and unnecessary to define. The term with the highest agreement was "Acronym (ACR)." This is the simplest tag and most easily identifiable through a series of uppercase characters. "Percent (CNT)" was also homogenous; although not all annotators tagged a specific percentage with the CNT tag, the ones who did were nearly unanimous in their identification of the term, without including extraneous surrounding characters. In contrast, a tag like "Method (METH)" was much less predictable in what was included. This illustrates a trend that more high-level tags, or tags that might prompt the user to include more words on average also have more variance in the words and characters included. However, there is an identifiable "hot zone" of text that users agree upon in the general area that the tag encompasses, with less agreed upon words on either side of the confirmed term. When creating a gold standard annotated article, the system can take the approach of only including words and character above a certain averaged confidence level. 34 5.3 Time Of the self-reported estimates for time spent on each article, the first took on average 40 minutes, the second took on average 20 minutes and the third, 15 minutes. Most users found 30 minutes to be far too little time to finish tagging the first article with the Science-specific tagset, and so did not finish that article. However, four of the Mechanical Turk workers spent between 45 minutes and an hour and ten minutes on the first article before moving on. The second two were consistently between 10 and 30 minutes, however, it is difficult to tell if that is because they second two tasks were simpler or if the testers were more experienced and faster after the first article. From tester feedback, participants expressed that they felt more savvy even as soon as the second article but attributed most of the time spent on the first article to it being a more difficult task than the second two: "The first one was the longest and most difficult. I spent about an hour on the first one before I decided to let it rest and move on to the second one." Another participant elaborated that the complexity of the tags used with the first article is what made it the most difficult: "In retrospect I should have emailed for clarification on the science-specific tags. I'm kind of neurotic with things like this, so even with your description and examples for tags, a few of them (e.g., assertion, term) looked like they could be loosely applied . . all over the place. I did a few passes and moved on to the other articles." He later returned to the first article again and found that the second pass through caused "Way less stress." In general, the consensus was that the first article task was most challenging: "...it confused me. The second one was much easier." The second and third were seen as the easiest: "Those were significantly easier to work through." In combination with the data on tag usage, it is reasonable to conclude that the first task was more involved than the other two. Comparing the first and third task, we have the first task requiring use of the complete first tagset group, while the third task requires the 35 user to use only one tag from the group. This makes the third task much simpler than the first. Similarly, the second task also uses an entire tagset, so the third task is less complex compared to the second task also. Comparing the first and second task, we have tasks with two different tagsets of approximately the same number of tags (the first with seven, the second with eight). However, the second task used less of the tags in its group, as we saw when analyzing tag usage. Also, user feedback pointed to particular tags in the first task as causing difficulty: "I'm not gonna lie, methods and results were a bugger; entire sections seemed to apply." For the purposes of the study, it is clear that a tag like Method was an unnecessarily complicated addition. However it stands to reason that Method is still a valuable tag for AnnoTool for regular users. Unlike study participants, they will not be instructed to tag any instance of methods in the text, only ones that interest them or serve their particular research purposes. This will serve the individual user while the system can still collect the full range of identifiable method phrases through the combination of many users' highlights. 5.4 Feedback In addition to feedback regarding article annotation time and difficulty level, participants volunteered feedback related to the tool's user interface, the ease of use and how interesting or enjoyable the task was in general. Regarding the user interface, one tester went into detail and suggested an improvement: "Overall the add-on itself is very well designed but could benefit from things like having the definitions for annotations easily accessible on the page you are annotating. It's a functional extension that gets the ball rolling though and I didn't have any problems with it." Another user offered a similar complaint regarding the lack of easily accessible definitions, forcing one to learn them all by memory. Referring to the first two articles, which were assigned tagsets with multiple tags: "Both Articles had many annotations to tag, the length of the articles and numbers of tags to keep memory of while going through tagging was very hard to keep up with." More explicitly: "Having 8 tags to keep memory of while 36 going through is extremely overwhelming for most people, unless the person has a second monitor to keep track of these things, and even then it would still be difficult." The same user suggested an improvement to the study design, minimizing the number of tags: "I think keeping it to one tag is best. Assign different people different tags and you'd most likely get more accurate results quicker. If you wanted an even bigger sample, cut the article into smaller parts with maybe 1-3 tags and then assign them, making sure some of these assignments overlap so you get more accuracy because more people are doing it. I don't believe having people annotate giant research articles like shown with multiple tags to keep in your head is feasible. I think cutting it up into as small as you can and still get enough overlap to be accurate is best." It's worth noting that the intuition behind this tip aligns with studies that emphasis redundancy as a data quality measure, as explained earlier in the Background section. Other problems included the highlights disappearing upon refreshing the page, the lack of editing capabilities for already-saved annotations, and requiring saving after each new annotation: "The challenges: Not knowing what had already been annotated and trying to remember after having been signed out and signing back in and, of course, refreshing." "And, finally, the last challenge: not being able to un-annotate, meaning remove all or part of a selection." "Being able to redo tags or save periodically instead of every tag would be my one quality of life suggestion from the worker's POV." There were two other surprisingly ubiquitous hitches. AnnoTool was designed with a keyboard shortcut to highlight text. Instead of requiring a user to first highlight text and then click a button to select this as the highlighted text each time they wished to create an annotation, a user would simply have to hold down Control while highlighting. Although this was demonstrated in the video and written in the instructions, many users had difficulty with this initially. Similarly, although users were instructed how to save the packaged Chrome extension tool for installation using a right-click and "Save as" procedure, many expressed initial difficulty with the installation process. "So I've tried every possible way to load the AnnoTool but Chrome will not let me load it. All I get is a paper, not an actual full file. I understand that it must be dragged to the extentions to be accepted by Chrome but Chrome refuses to even allow the download." "First technical problem: Chrome not 37 allowing the extension. It took me a while (grrrrrrrr!), but I figured out a work-around, however others may not know how to do that." "I cannot download the annotool, it wont install the extension in chrome, any tips?" However, both of these issues are easily solved. AnnoTool is currently hosted on a private server rather than the Chrome store while it is in a testing phase. After it is downloadable through the Chrome store users will not have to deal with installation more complicated than selecting the program through the Chrome store. For highlighting, the user interface can add a button to select highlighted phrases for users who wish to avoid using keyboard shortcuts, or add an interactive tutorial that demonstrates usage. One interesting observation is how many study participants continued annotating the first article past the recommended 20 - 30 minute range, even though the instructions expressed it was acceptable to leave an article before finishing it and to do so in order to budget time for the others. Many comments reflected this: "Article 1 I spent over an hour on and I'm not even close to being done with it." "Alright. Well, I've been at this for two hours now. None of them are completely done. I think maybe I was being too much of a perfectionist with them." "I've been working on the first article for an hour and ten minutes." "Sorry I went over. That's a lot to read." "I don't know how other people did it, or are doing it, but I read every single word. I realized by the third article that maybe it wasn't really necessary to do that... .but I'm also a bit of a perfectionist so what I struggled with may not be what others struggle with." Of the annotators who did move on after 30 minutes, many felt the need to explain that they did so, regardless of the fact that the instructions stressed there was no expectation to finish an article and no penalty: "For a full job on these I believe it would take much longer than 20m." "...the assignments will need to be smaller or the time allotted/suggested will need to be higher." "Hey the first article took me 45 minutes.I read almost everything and did my best." "It took me 45 minutes to do the first one, I stopped half way through because it confused me." "I finished the annotations to the best of my ability." In the feedback regarding article completion, two Turkers explicitly mentioned being perfectionist and another explained a similar response to annotating; "I'm kind of neurotic 38 with things like this." It would be interesting to further study the psychological traits of annotators, especially in regards to how it correlates with annotation quality and quantity. There are too many factors to tease out in this particular study, but a few noteworthy points: Many annotators did not, perhaps could not, easily move on from an article that was unfinished. Turkers select their HITs, so this was a self-selected group. Furthermore, there was a qualification test, albeit a very simple and short one, that Turkers had to take before selecting this HIT, requiring a minor time investment on their part. Finally, apart from the apparent stress caused by self-proclaimed perfectionism, participants seemed to enjoy the task and experienced pride when gaining experience and confidence: "...you could read over the articles once or twice and feel pretty confident that most of the stuff you tagged was correct. I walked away from this I had done a better job than the last two in less time." "Thanks for the opportunity to do this HIT, I enjoyed the concept of it..." "Interesting tool to use..." "Once again thanks for letting me do the work, was some interesting reading." 5.5 Inter-Annotator Agreement There were a number of factors that worked against annotator agreement. The testers were not professional annotators and were completely inexperienced with the system and the tagset. They were introduced to the system, given a short written overview of the tags with example usage, and asked to do their best, without the option to ask questions to confirm their interpretations or request feedback. They were not professional researchers who regularly read academic papers. They were given three scientific journal articles to read and annotate, but not enough time to complete the articles. The instructions acknowledged that the time allotted per article was most likely not sufficient and to simply proceed to the next article after 30 minutes. However, many testers professed in their feedback that they had personalities which did not function well with leaving a task unfinished, and that this style of experiment caused them stress, potentially leading to lesser results. Two of the tasks involved multiple tags per article. Annotators may have had different styles which contributed to different tagging results. One approach would be to make a pass through the article once for each tag. In this case, if the annotator was short on time, 39 some tags might not even be used. Another approach is to go through the article once, using tags when they appeared. If short on time, this approach might have a more even distribution of tag usage, but have no highlighted phrases after a certain point in the article. The insufficient time allotment meant that agreement between varying approaches suffered. Despite these many factors, there was still noticeable agreement between annotations. Let us look at the third article assigned, "Non-Invasive Brain-to-BrainInterface (BBI): Establishing FunctionalLinks between Two Brains" [71]. The task for this article was to use a single tag called "Term" for any words or phrases which could be paper or fieldspecific lingo. To study the results of inter-annotator agreement, all highlighted terms were collected and paired with a count of how many annotators had tagged them. As annotators were instructed to tag a given term only once in an article, multiple highlights of the same term in an article did not count more than once. Looking at the table of counts of agreed-upon terms gives a sense of the high data quality; All of the highlighted phrases are reasonable instances of the term (TER) tag. A standard technique of calculating inter-annotator agreement is to use confusion matrices. Those familiar with natural language annotation might wonder why this analysis approach was not used here. Confusion matrices have different, traditional assumptions, namely that the annotators intend to annotate thoroughly and highlight each instance of a tag. If a phrase is not highlighted, it implies that the annotator did not think it fit into a tag category. In contrast, here the assumption is that an annotator will not highlight every instance of a tag. Doing so would be redundant and provide them no extra value. Not highlighting a phrase does not imply anything. The same phrase may be highlighted elsewhere in the article. Or in real-world usage, users simply will not necessarily spend time highlighting what they already know, even if a certain tag is applicable. For example, consider an article that contains the acronyms "BCI" and "fMRI". When using AnnoTool for my own research, I might tag fMRI because I have never heard the term before or because I am specifically annotating for procedures and technologies I could use in my own experiments. I refrain from tagging BCI, not because I do not think it is an acronym, but because I am already familiar with the term or because it is not appropriate 40 Terms (TER) Number Transcranial focused ultrasound 12 computer-to-brain interface 11 brain-to-computer interface 10 Deep brain stimulation, Magneto-encephalography, Transcranial magnetic stimulation 9 brain-to-brain interface, functional magnetic resonance imaging, functional transcranial doppler sonography 8 electroencephalogram, motor cortex, near infrared spectroscopy 7 epicortical stimulation, false negatives, false positives, multi-channel EEG acquisition, Neural Coupling, pulse repetition frequency, steady-state-visual-evoked-potentials 6 Brain-to-brain coupling, optically-tracked image-guidance system, parameter optimization, single-montage surface electrodes, sonication, tone burst duration, true negatives, true positives 5 accuracy index, baseline condition, caudal appendage, event-related desynchronization, ex vivo, F1-score, focused ultrasound, FUS-based CBI, implantable cortical microelectrode arrays, neuromodulation, spatial-peak pulse-average intensity, spatial-peak temporal-average intensity, standard deviation, stereotactic guidance, temporal hemodynamic patterns, Transcranial sonication of focused ultrasound 4 brain-machine-brain interface, data acquisition hardware, electroencephalographic steady-state-visual-evoked-potentials, in situ, intracortical, intraperitoneal injection, mechanical index, navigation algorithms, peripheral nervous system, Sprague-Dawley rat, steady-state visual evoked potential, stored-program architecture devices, ultrasound 3 acoustic energy, acoustic intensity, air-backed, spherical-segment, piezoelectric ultrasound transducer, astroglial systems, cavitation threshold, chemically-induced epilepsy, complex motor intentions, computer-mediated interfacing, cortical electrode array, cortical microelectrode arrays, detection accuracy, detection threshold, EEG-based BCI, electro-magnetic stimulation, electroencephalographic, external visual stimuli, extracellular neurotransmitters, function generators, functional imaging, FUS, implanted cortical electrode array, in vitro, intracortical microstimulation, linear power amplifier, Matlab codes, motor cortex neural activity, neural activity, neural electrical signals, neuromodulatory, non-invasive computer-to-brain interface, piezoelectric ultrasound transducer, pressure amplitude, sensory pathways, signal amplitude, signal fluctuations, slow calcium signaling, somatomotor, somatomotor areas, sonicated tissue, sonication parameters, spatial activation patterns, spatial environments, spatial patterns, Sprague-Dawley, square pulse, stereotactic, ultrasound frequency, visual stimulation 2 Figure 5.1: The Terms from the third article with inter-annotator agreement, grouped by number of annotators who identified the term. The collection of highlighted phrases identified with the Term tag reveals a high quality collection. 41 Tagged Terms (TER) In Article 3 12 9 6 a 12 0 Tems Figure 5.2: The distribution of agreed-upon Terms in the third article. 42 in the context of my specialized tagging task. 5.6 Conclusion This study has shown the viability of using untrained, non-expert annotators to create annotated corpora using scientific texts. Researchers in crowdsourced natural language annotation studies demonstrated that non-expert annotators can produce high quality annotations using non-technical text but this study showed that it is possible using highly technical text from scientific journal articles. The study design ensured a highly conservative assessment of inter-annotator agreement yet produced excellent quality results. It stands to reason that data quality will improve further with future versions of AnnoTool as a result of improvements to the user interface and removal of the study constraints that worked against annotator agreement. Additionally, acquiring an annotator base with experience using AnnoTool will remove practical obstacles to annotating faster and more thoroughly. 43 44 Chapter 6 Future Work The study succeeded in showing that it was possible to use untrained annotators to annotate technical scientific articles. It is reasonable to assume that the self-selected group of users outside of the study scenario (those users likely to read and annotate scientific articles) would produce annotations at least as good or better. It is also likely that improvements to the user interface would support better annotations. Explicit feedback from the users revealed the most important and necessary user interface improvements. Users need to be able to edit and delete annotations they made by accident or would like to alter. Users would appreciate a list of the annotations recently made, especially if those were able to be filtered by tag. In its current iterations annotations are cleared from the page (although not from the database) when the page is refreshed. Future versions of AnnoTool would reload all previous annotations and highlight the page accordingly. What will be interesting to see next is if the tool gains a user base after being released in the Chrome store. Although there are many options for simple annotation tools, the outspoken community around one tool in particular, Mendeley, reveals the strong desire for an open source annotation application specifically built for a scientific community. Mendeley is a program for organizing, annotating and sharing scientific research papers, available in both desktop and web form. Ironically, its passionate user base was most apparent when vocally boycotting the program after its acquisition by Elsevier, a scientific publishing giant with a reputation 45 for implementing and advocating for restrictive publishing practices. Among many others, Danah Boyd, a Senior Researcher at Microsoft Research, expressed her strong stance through Twitter and blog posts, explaining, "I posted the announcement on Twitter to state that I would be quitting Mendeley. . . I was trying to highlight that, while I respected the Mendeley team's decision to do what's best for them, I could not support them as a customer knowing that this would empower a company that I think undermines scholarship, scholars, and the future of research" [9]. The community outcry gave insight into how deeply important open access is valued. It is so fundamentally important that users were willing to relinquish a tool they had grown to feel very passionately about, even though that meant porting all data stored in Mendeley, learning a new system, and reestablishing their social research network in the new environment, if that was even offered as a feature [30]. Currently the closest open access alternative to Mendeley is Zotero [72], a free and open source reference management software. Alternately the tool could be outfitted with the capability to design and save a certain configuration of tags, and send this configuration to individuals hired to annotate chosen articles. For the study the tool uses a predetermined set of tags, but has the ability to incorporate user-added tags. This is one of its potential strengths but also adds additional complexity during analysis. User-added tags allow the user to customize their annotation needs. One instance where this could be useful is when a user is annotating for a unique project, for example tagging lines in articles that reference the founder of their fields for a project compiling historical reflection in scientific papers. The user-created tags for this project might include "Reference to Single Main Founder", "Reference to Multiple Founders" and "Disputed Origin." Obviously these tags are extremely specific to that user's project and probably not useful to most readers. Allowing individual users to create their own tags makes the tool useful in a wider range of contexts and thus more likely to be used. At the same time it simplifies presentation and usage, because it avoids having to clutter the tagbase with all possible tags a user might want to use. However, the flip side of allowing users to create tags is that it adds additional com46 plexity when it comes to analyzing inter-annotator agreement. The system is then tasked with recognizing and grouping user-added tags that perform similar functions. At its most basic, this could work whereby: First, the system keeps track of all highlighted phrases in a given article. It identifies clusters of highlighted phrases that most likely refer to the same phrase. For each phrase cluster, it weights each possible interpretation of the particular phrase through a combination factors such as the number of times that interpretation was chosen by users and an average of those users' "reputation" scores. A user's reputation score in turn is bolstered whenever their interpretation is chosen and to a lesser extent, whenever they create new tags. After an interpretation for a phrase cluster is decided upon, that becomes the accepted interpretation. The count attributed to the accepted interpretation includes all of the possible interpretations in the phrase cluster. Finally, all accepted interpretations are ranked by their counts. Those with counts above a particular threshold or the top x percent could be designated high-confidence highlights. One possible thing to do with these high-confidence highlights is to display a crowdsourced highlighted article. One possible thing to do with these high-confidence highlights is to display a crowdsourced highlighted article. These high-confidence highlights are what populate the annotated corpus. This is a high-level walkthrough. Some of the finer details to work out would be discerning clusters correctly, without incorrectly combining disparate clusters. AnnoTool allows for highlighted phrases within longer highlighted phrases, for instance, which might particularly appear as variations of the same cluster. One hint in discerning these would be taking into account differing tags and how likely those tags are to have similar meanings, which in turn could be determined by clustering how often they refer to the same highlighted text. As an article reaches a certain level of maturation, in that it has been annotated thoroughly enough it is unlikely that new user annotations will alter its annotation calculations, it can be "locked" and considered a gold-standard article. This becomes useful when giving feedback to the user on their own annotation style. For instance, consider the case where a 47 user created a tag called "Outcome" and tags phrases in a way that aligns with the article's tagged "Result" phrases. The system can inform the user that they use Outcome in the same way that many use Result and can offer to combine and rename all of the user's Outcome tags. Even if the user rejects this proposed change, internally, the system can create a user-specific translation that Outcome is equivalent to Result. System feedback actively given to the user could serve as a form of annotator training, in addition to the more passive education annotators can reach on their own by viewing examples of other users' annotations. One unknown is whether annotator quality will improve with time and with training. Akkaya et al. investigated the learning effect for simple annotation tasks and found that annotator reliability does not improve with time. However they propose that a follow-up study involving complex annotation tasks could reveal the opposite, particularly if workers are allowed to see others' answers. Future versions of AnnoTool would be in a positive to study this. So far the only user benefits of AnnoTool that have been discussed are those that resemble a more or less traditional annotation program. Users can highlight and add notes to articles. They can customize their settings, including adding or deleting new tags, changing the highlight colors for tags, and grouping tags on the UI in the arrangement that makes sense to them. Unlike many annotation programs the tool is browser-based and stores everything in the cloud. This is a design choice that allows for the really distinguishing features of AnnoTool from an annotator's perspective. Annotators can benefit from all AnnoTool data. They can view annotations by a specific user. They can go to an article that has been annotated by others and retrieve annotations on that article that have reached a certain level of consensus. In early stages of an article's annotation history, where consensus is still trying to be achieved, they can upvote or downvote different contenders. As the database of annotated articles grows they can perform more detailed searches across articles. For instance, a user could request to see a collection of all new articles in a particular sub-field since their last query to the system. They could further refine this query to return only the names of the articles, their authors, and phrases from the articles highlighted with the Result tag. 48 Further additions to the system could involve more automated textual analysis. So for example the system could take in a query like "Cause of obesity I Result" and return articles discussing the cause of obesity in phrases tagged with Result, this time grouped into clusters based on similar content of their Result tags. The user could then see a group of articles discussing primarily genetic factors in the Result-tagged phrases, another focusing on shared psychological traits and another discussing socioeconomic status. As the system amassed a larger knowledge base and grew increasingly sophisticated it could for example identify that an article discussing neuroticism as a cause for obesity should be grouped with the "psychological traits" group as neuroticism is a psychological trait. The background knowledge of this "is a" relationship could be developed through user tagging relationships or partnering with databases such as Open Mind Common Sense. 49 References 1. 2. Adda, G., & Cohen, K. B. (2011). Last Words Amazon Mechanical Turk: Gold Mine or Coal Mine ? Akkaya, C., Conrad, A., Wiebe, J., & Mihalcea, R. (2010). Amazon Mechanical Turk for subjectivity word sense disambiguation, 195-203. Retrieved from http://dl.acm.org/citation.cfm?id=1 866696.1866727 3. Amazon. (n.d.-a). Amazon Mechanical Turk Command Line Tools: Developer Tools: Amazon Web Services. Retrieved September 27, 2013, from https://aws.amazon.com/developertools/Amazon-Mechanical-Turk/694 4. Amazon. (n.d.-b). Amazon Mechanical Turk Requester. Retrieved September 27, 2013, from https://requester.mturk.com/ 5. AnaWiki. (n.d.). Phrase Detectives - The AnaWiki annotation game. Retrieved September 27, 2013, from http://anawiki.essex.ac.uk/phrasedetectives/ 6. Apple. (n.d.). Apple Siri. Apple. Apple.com. Retrieved September 27, 2013, from http://www.apple.com/iphone/features/siri.html 7. Basile, V., Bos, J., Evang, K., & Venhuizen, N. (2012a). A platform for collaborative semantic annotation, 92-96. Retrieved from http://dl.acm.org/citation.cfm?id=2380921.2380940 8. Basile, V., Bos, J., Evang, K., & Venhuizen, N. (2012b). Developing a large semantically annotated corpus. LREC, 3196-3200. Retrieved from http://www.lrec-conf.org/proceedings/lrec2Ol2/pdf/534_Paper.pdf 9. Boyd, D. (2013). danah boyd Iapophenia why I'm quitting Mendeley (and why my employer has nothing to do with it). Retrieved September 27, 2013, from http://www.zephoria.org/thoughts/archives/2013/04/1 1/mendeley-elsevier.html 10. Callison-Burch, C., & Dredze, M. (2010). Creating Speech and Language Data With Amazon's Mechanical Turk. Human Language Technologies Conference, 1 - 12. Retrieved from https://en.wikipedia.org/wiki/Crowdsourcing 11. Campbell, M. (2012). Siri used by 87% of iPhone 4S owners, study claims. Apple Insider. Retrieved September 27, 2013, from http://appleinsider.com/articles/12/03/26/siri used by_87_of iphone_4sowners studyclaims 12. Carpenter, B. (2008). Dolores Labs' Text Entailment Data from Amazon Mechanical Turk I LingPipe Blog on WordPress.com. LingPipeBlog. Retrieved September 27, 2013, from http://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk 13. 14. 15. 16. 17. 18. Carpenter, B. (2009). Phrase Detectives Linguistic Annotation Game I LingPipe Blog on WordPress.com. LingPipe Blog. Retrieved September 27, 2013, from http://lingpipe-blog.com/2009/l I/10/phrase-detectives-linguistic-annotation-game/ Carpenter, B. (2012a). Mavandadi et al. (2012) Distributed Medical Image Analysis and Diagnosis through Crowd- Sourced Games: A Malaria Case Study ILingPipe Blog on WordPress.com. LingPipeBlog. Retrieved September 27, 2013, from http://lingpipe-blog.com/2012/05/05/mavandadi-2012-distributed-medical-image-analysi/ Carpenter, B. (2012b). Another Linguistic Corpus Collection Game ILingPipe Blog on WordPress.com. LingPipe Blog. Retrieved September 27, 2013, from http://lingpipe-blog.com/201 2/11/12/another-linguistic-corpus-collection-game/ Carpenter, B. (2013). VerbCorner: Another Game with Words I LingPipe Blog on WordPress.com. LingPipe Blog. Retrieved September 27, 2013, from http://lingpipe-blog.com/2013/07/04/verbcorner-another-game-with-words/ Cutler, K.-M. (2013). Peter Thiel's Breakout Labs Funds "Nanostraws" And A Siri-Like Natural Language Processing Startup I TechCrunch. TechCrunch. Retrieved September 27, 2013, from http://techcrunch.com/2013/04/17/breakout-labs-skyphrase-stealth-biosciences/ Django. (n.d.-a). Django Models. Retrieved September 27, 2013, from https://docs.djangoproject.com/en/1.5/topics/db/models/ 19. Django. (n.d.-b). Django URL dispatcher. Retrieved September 27, 2013, from https://docs.djangoproject.com/en/1.5/topics/http/urls/ 20. Django. (n.d.-c). Django Getting started. Retrieved September 27, 2013, from https://docs.djangoproject.com/en/1.5/intro/ 21. Django. (n.d.-d). Django documentation. Retrieved September 27, 2013, from https://docs.djangoproject.com/en/1.5/ 22. Dulik, M. C., Osipova, L. P., & Schurr, T. G. (2011). Y-chromosome variation in Altaian Kazakhs reveals a common paternal gene pool for Kazakhs and the influence of Mongolian expansions. PloS one, 6(3), e17548. doi:10.1371/journal.pone.0017548 23. Duolingo. (n.d.). Duolingo. Retrieved September 27, 2013, from http://www.duolingo.com/ 24. Ehterington, D. (2012). Maluuba Launches Natural Language Processing API, Brings Siri-Like Powers To Any App ITechCrunch. TechCrunch. Retrieved September 27, 2013, from http://techcrunch.com/2012/11/14/maluuba-launches-natural-language-processing-api-brings-siri-li ke-powers-to-any-app/ 25. Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold Mine or Coal Mine? ComputationalLinguistics, 37(2), 413-420. doi:10.1162/COLI_a_00057 26. Google. (n.d.-b). Chrome Extension Overview. Retrieved September 27, 2013, from http://developer.chrome.com/extensions/overview.html 27. Google. (2013). Getting Started: Building a Chrome Extension. Google. Retrieved September 27, 2013, from http://developer.chrome.com/extensions/getstarted.html 28. Graham, M. (2011). Wiki Space: Palimpsests and the Politics of Exclusion. In G. W. Lovink & N. Tkacz (Eds.), CriticalPoint of View: A Wikipedia Reader (pp. 269-282). Institute of Network Cultures. 29. Havasi, C., Speer, R., & Alonso, J. (2007). ConceptNet 3: a flexible, multilingual semantic network for common sense knowledge. Recent Advances in Natural .... Retrieved from http://www.media.mit.edu/-jalonso/cnet3.pdf 30. Ingram, M. (2013). The Empire acquires the rebel alliance: Mendeley users revolt against Elsevier takeover - paidContent. paidContent. Retrieved September 27, 2013, from http://paidcontent.org/2013/04/09/the-empire-acquires-the-rebel-alliance-mendeley-users-revolt-aga inst-elsevier-takeover/ 31. Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on Amazon Mechanical Turk. In Proceedingsof the ACMSIGKDD Workshop on Human Computation - HCOMP '10 (p. 64). New York, New York, USA: ACM Press. doi:10. 145/1837885.1837906 32. Kelly, Meghan, & Kelly, M. (2012). Were Apple's Siri ads "false and misleading"? The Washington Post. Retrieved from http://www.washingtonpost.com/business/technology/were-apples-siri-ads-false-and-misleading/2 012/03/13/gIQAtBBWGS_ story.html?tid=pm-businesspop 33. Le, J., & Edmonds, A. (2010). Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. ... crowdsourcingforsearch ... , (Cse), 17-20. Retrieved from http://ir.ischool.utexas.edu/cse2010/materials/leetal.pdf 34. Less, F., & Taught, C. (2006). Simple Named Entity Guidelines, 1-15. 35. Lowensohn, J. (2012). Apple's Siri not as smart as she looks, lawsuit charges. TechCrunch. Retrieved March 12, 2012, from http://news.cnet.com/8301-27076_3-57395727-248/apples-siri-not-as-smart-as-she-looks-lawsuit-cha rges/ 36. Lunden, I. (2013). Wavii Confirms Google Buy, Shuts Down Its Service To Make Natural Language Products For The Search Giant I TechCrunch. TechCrunch. Retrieved September 27, 2013, from http://techcrunch.com/2013/04/26/wavii-confirms-google-buy-shuts-down-its-service-to-make-natu ral-language-products-for-the-search-giant/ 37. Maluuba. (n.d.). Natural Language Processing Technology IMaluuba. Retrieved September 27, 2013, from http://www.maluuba.com/ 38. Mavandadi, S., Dimitrov, S., Feng, S., Yu, F., Sikora, U., Yaglidere, 0., . . Ozcan, A. (2012). Distributed medical image analysis and diagnosis through crowd-sourced games: a malaria case study. PloS one, 7(5), e37245. doi:10.1371/journal.pone.0037245 39. McCandless, M. (2011, May). Chromium Compact Language Detector. Retrieved from http://code.google.com/p/chromium-compact-language-detector/ 40. Mims, C. (2011). Translating the Web While You Learn. MIT Technology Review. Technology Review. Retrieved September 27, 2013, from http://www.technologyreview.com/computing/37487/ 41. Nowak, S., & Rtiger, S. (2010). How reliable are annotations via crowdsourcing. In Proceedingsof the internationalconference on Multimedia information retrieval - MIR '10 (p. 557). New York, New York, USA: ACM Press. doi:10.1 145/1743384.1743478 42. O'Connor, B. (2008a). Wisdom of small crowds, part 2: individual workloads and rates I The CrowdFlower Blog. CrowdFlower. Retrieved September 27, 2013, from https://crowdflower.com/blog/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-ra tes/ 43. O'Connor, B. (2008b). AMT is fast, cheap, and good for machine learning data I The CrowdFlower Blog. CrowdFlower. Retrieved September 27, 2013, from https://crowdflower.com/blog/2008/09/amt-fast-cheap-good-machine-learning/ 44. O'Connor, B. (2008c). Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick) The CrowdFlower Blog. CrowdFlower. Retrieved September 27, 2013, from https://crowdflower.com/blog/2008/06/aggregate-turker-judgments-threshold-calibration/ 45. Pustejovsky, J., & Stubbs, A. (2012). NaturalLanguage Annotationfor Machine Learning(p. 342). O'Reilly Media. Retrieved from http://www.amazon.com/Natural-Language-Annotation-Machine-Learning/dp/1449306667 46. Pustejovsky, J., & Stubbs, A. (2013). Interview on Natural Language Annotation for Machine Learning. O'Reilly Media. Retrieved September 27, 2013, from https://www.youtube.com/watch?v=-bOUNKzawg 47. Siegler, M. (2011). Meet Duolingo, Google's Next Acquisition Target; Learn A Language, Help The Web. TechCrunch. TechCrunch. Retrieved September 27, 2013, from http://techcrunch.com/2011/04/12/duolingo/ 48. Siorpaes, K., & Hepp, M. (2008). Games with a Purpose for the Semantic Web. IEEE Intelligent Systems, 23(3), 50-60. doi:10.1 109/MIS.2008.45 49. Snow, R., O'Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks, 254-263. Retrieved from http://dl.acm.org/citation.cfm?id=1613715.1613751 50. Speer, R., Havasi, C., & Surana, H. (2010). Using Verbosity: Common Sense Data from Games with a Purpose, 2000(FLAIRS), 104-109. 51. Speer, R., & Lieberman, H. (2008). AnalogySpace : Reducing the Dimensionality of Common Sense Knowledge, 548-553. 52. Strassel, S. (2003). Simple Named Entity Guidelines. Linguistic Data Consortium. Retrieved September 27, 2013, from http://projects.ldc.upenn.edu/SurpriseLanguage/Annotation/NE/index.html 53. Sung, D. (2011). What is Siri? Apple's iPhone 4S assistant explained. Pocket-lint. Retrieved September 27, 2013, from http://www.pocket-lint.com/news/42420/what-is-siri-iphone-4s 54. Swift, S. a, Moore, D. a, Sharek, Z. S., & Gino, F. (2013). Inflated applicants: attribution errors in performance evaluation by professionals. PloS one, 8(7), e69258. doi:10.1371/journal.pone.0069258 55. Swisher, K. (2013). Yahoo Acquires Hipster Mobile News Reader Summly (Like We Said) - Kara Swisher - Mobile - AllThingsD. All Things D. Retrieved September 27, 2013, from http://allthingsd.com/20130325/yahoo-acquires-hipster-mobile-news-reader-summly-like-we-said-itmight/ 56. Tastypie. (n.d.). Getting Started with Tastypie. Retrieved September 27, 2013, from http://django-tastypie.readthedocs.org/en/latest/tutorial.html 57. The Mendeley Support Team. (2011 a). Getting Started with Mendeley. Mendeley Desktop. London: Mendeley Ltd. Retrieved from http://www.mendeley.com 58. The Mendeley Support Team. (201 lb). Getting Started with Mendeley. Mendeley Desktop. London: Mendeley Ltd. Retrieved from http://www.mendeley.com 59. 60. 61. 62. 63. 64. The Mendeley Support Team. (2011 c). Getting Started with Mendeley. Mendeley Desktop. London: Mendeley Ltd. Retrieved from http://www.mendeley.com Venhuizen, N., & Basile, V. (2013). Gamification for word sense labeling. Proc. 10th International .. Retrieved from http://www.newdesign.aclweb.org/anthology/W/W13/W13-0215.pdf Vesselinov, R., & Grego, J. (2012). Duolingo Effectiveness Study. Retrieved from https://en.wikipedia.org/wiki/Duolingo Von Ahn, L. (n.d.). Human computation, 418-419. Retrieved from http://ieeexplore.ieee.org/articleDetails.jsp?arnumber-5227025 Von Ahn, L. (2006). Games with a Purpose. Computer, 39(6), 92-94. doi:10.1 109/MC.2006.196 Von Ahn, Luis, & Dabbish, L. (2008). Designing games with a purpose. Communications of the ACM, 51(8), 57. doi:10.1145/1378704.1378719 65. Welinder, P., & Perona, P. (2010). Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops (pp. 25-32). IEEE. doi:10.1 109/CVPRW.2010.5543189 66. WolframAlpha. (n.d.-a). WolframlAlpha: Computational Knowledge Engine. Retrieved September 27, 2013, from http://www.wolframalpha.com/ 67. WolframAlpha. (n.d.-b). Wolframalpha.com Site Info. Alexa Internet. Retrieved September 27, 2013, from http://www.alexa.com/siteinfo/wolframalpha.com 68. WolframAlpha. (2013). How much web traffic does wolframalpha get - WolframlAlpha. Wolfram Alpha. Retrieved September 27, 2013, from http://www.wolframalpha.com/input/?i=how+much+web+traffic+does+wolframalpha+get 69. Words, G. with. (n.d.). Games With Words. Retrieved September 27, 2013, from http://www.gameswithwords.org/VerbCorner/ 70. Yeung, K. (2013). Clipped launches as a Flipboard competitor, using natural language processing to better curate news - The Next Web. The Next Web. Retrieved September 27, 2013, from http://thenextweb.com/apps/2013/01/01/clipped-launches-as-flipboard-competitor-to-help-curate-n ews/ 71. Yoo, S.-S., Kim, H., Filandrianos, E., Taghados, S. J., & Park, S. (2013). Non-invasive brain-to-brain interface (BBI): establishing functional links between two brains. PloS one, 8(4), e60410. doi: 10.1 371/journal.pone.0060410 72. Zotero. (n.d.). Zotero.