Human Computation AAAI Technical Report WS-12-08 Playful Surveys: Easing Challenges of Human Subject Research with Online Crowds Markus Krause, Jan Smeddinck, Aneta Takhtamysheva, Velislav Markov, Nina Runge University of Bremen, TZI, Bremen, Germany phateon@tzi.de, smeddinck@tzi.de, aneta@tzi.de, velislavmarkov@yahoo.com, nr@tzi.de In recent years, it has become a reasonable alternative to carry out human subject research via web-based systems (Horton et. al. 2010). Different attempts were made to apply crowdsourcing (CS) to support empirical studies in software engineering (Stolee et al. 2010) or user studies (Kittur et al. 2008) in general by employing the established crowdsourcing platforms (e.g. Amazon Mechanical Turk , CrowdFlower or MicroTask ) as tools for acquiring participants, structuring procedures and presenting experiment interfaces. Schmidt (2010) describes a number of challenges for HSR on such platforms, which evolve around the fact that the established CS systems were not designed with HSR in mind. While current crowdsourcing platforms are limited in terms of their suitability for HSR, they are still promising means for large-scale and comparatively low-cost studies. They allow for fast and parallel access to large groups of participants that used to be out of reach for many researchers in the past. Abstract A major challenge of human subject research lies in motivating enough subjects to participate in studies. Traditionally, participants are extrinsically motivated, for example by getting paid for their contribution. Together with the effort of organizing and supervising experiments, this renders human subject research either very expensive, or reduces the validity due to small sample sizes. This work describes the method of utilizing playful web-based surveys to intrinsically motivate contributors to participate in studies and illustrates the approach with two examples: a study of the effect of retouching portraits on the perception of human faces and according estimates of wealth and success that was distributed via a single announcement on a social network and attracted more than 2400 participants within a five months period, as well as a study on the perception of a questionnaire in the form of a playful survey as compared to a more traditional online questionnaire, which showed that participants are more likely to recommend playful surveys to friends than normal surveys. Keywords: human subject research, crowdsourcing, survey, motivation, social media In the light of these promises, a substantial amount of research projects and businesses have turned to collecting input from web-users. The tools that have been employed range from well-established online survey tools (e.g. SurveyMonkey, LimeSurvey, or uTest) to online experiments (Fogg et al. 2001) and remote usability testing (Andreasen et al. 2007). While the range of potential participants theoretically encompasses every Internet user, practitioners often times still have to manually recruit subjects or end up with a small amount of participants (Kittur et al. 2008). Due to increasingly frequent customer feedback requests, survey fatigue has a considerable and potentially biasing effect. Besides acquiring sufficient numbers of the right participants, detecting malicious behavior and legal and ethical considerations also pose substantial challenges. Introduction A common problem of human subject research (HSR) is the acquisition of large numbers of participants for studies and meeting sampling requirements within that group. This applies for many research fields such as social and political science, psychology, public health, market research, HCI, and others. Practitioners in these fields face the challenge to recruit and schedule participants for their studies in a timely and cost-effective manner. They have to pay for space, equipment and assistants to advertise, schedule, and run their studies. This becomes even more demanding if the participants are not locals. However, the central challenge for surveys still remains to be motivation. By now the power of computer and video games to intrinsically motivate people to participate in se- Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 118 rious activities is well known. In this light, we propose a method for HSR with online crowds called playful surveys. repetitive small tasks relayed to mostly anonymous workers that are the standard form of tasks on most CS systems. Such surveys include playful elements that resemble simple game-mechanics or storytelling, in order to attract participants that are motivated to contribute because it is fun. The two examples presented herein build upon a playful element that is similar to common personality quizzes in magazines. In such quizzes, the participant gets a certain score for each answer. By summing up these individual scores, the reader receives a final score that is connected to one of a number of different personality assessments. In an informal pre-study only few people considered these questionnaires serious, yet the participants still claimed to give truthful answers at least while doing the questionnaire for the first time. The common concerns of HSR can be classified into challenges related to (cf. Schmidt 2010): The (1) participant pool, including considerations about participant demographics, motivation, past research experience, skills and knowledge and pre-selection. HSR often requires careful participant selection regarding multiple parameters that are defined by the purpose of each study. Most commonly filtering happens across age, gender, educational level, interests, skills, and other factors. Current crowdsourcing platforms only provide very limited demographic information (Alonso et al. 2008). It is also important for researchers to be aware of potential sampling bias, because CS workers are neither representative of their national, nor of the global population (Paolacci et al. 2010, Ross et al. 2010). Since CS workers are typically motivated to provide their services for financial remuneration, they are interested in fulfilling short and simple tasks that will give them opportunity to accomplish as many tasks as possible (Mason et al. 2009), which may impact the quality of human subject studies responses. Lastly, human subject studies oftentimes require establishing enduring contacts with the participants to carry out follow-up tests or run series of surveys. This requires access to the contact or profile information of selected groups of participants who expressed interest in longer cooperation, which is difficult to achieve with current CS platforms. The first study examined the effect that digital image retouching of portraits photographs has on the perception of the depicted human faces. In the light of the methodological focus of this paper, the results of the experiment are promising especially in terms of participant numbers. The study was announced and published via a single Facebook account, attracting more than 2400 participants within a five month trial period and without any further advertising. The second study aimed to compare the perception of playful surveys to the perception of more traditional electronic surveys. (2) Detecting bad behavior: The anonymous nature of typical CS tasks creates space for potentially malicious or unethical behavior (Downs et al. 2010). Spamming and negligent work when quality is unverifiable have been some of the major issues of crowdsourcing and human computation platforms. Quality control mechanisms like majority vote (Ipeirotis et al. 2010) and gold questions can significantly increase the quality of the results (Kittur et al. 2008). However, more sophisticated mechanisms are needed to detect and prevent malicious behavior in HSR where there frequently are no absolute “correct” responses. Figure 1: UI Screenshot of the presented experiment. The three different images are always of different retouch levels but of the same sex. Amongst other aspects, the question of fair financial remuneration for the efforts of crowd-workers raises (3) ethical and legal considerations. While underpay is subject to employer ethics, higher rates create pressure to participate depending on financial need of the participant. This, in turn may result in sampling bias and uncommitted participation (Frei 2009, Horton et al. 2010). Challenges of Human Subject Research with the Crowd Focusing on the classic crowdsourcing approach with paid workers on existing platforms, Schmidt (2010) highlights that HSR tasks differ from typical CS tasks and are often difficult to implement with the templates and structures that the established platforms provide. The variety of research fields and topics that utilize HSR entails complex requirements that cannot be tackled with the batches of (4) Study design tools and (5) study management tools are not yet readily available for the major CS platforms and cannot easily be implemented due to a lack of control over the environment in which CS tasks are deployed. 119 Finally, (6) managing participant expectations may be an issue due to the differences between typical CS tasks and typical HSR tasks. Additional clarification of the task should be provided to the workers. Still participants’ expectations of different types of tasks and goals of the creator of the task may increase the risk of bias and noise of the results. final “evaluation” screen was displayed as shown in figure 2. The screen showed one of three different texts. The responses provided by the participants during the survey determine which text is shown. Each time a participant ranks an image, the distance to the majority vote for that image is calculated. In the end all differences are summed up giving a final result. This result determines which text is shown. The closer the participants answers are to the majority vote, the more positive the text will be. Receiving such a personality report adds a playful notion to participating in the survey. Despite of accurate accounts of the requirements of HSR practitioners and the relating shortcomings in the offers of current CS systems, there is a lack of concrete approaches to systematically deal with the challenges stated above. The following section provides an account of a study that was deployed on a social network and employed playful elements, resulting in the concept of playful surveys as a strategy that can ease some of the issues with participant pools, detecting and preventing bad behavior, ethical and legal considerations as well as managing participant expectations. "( !#( !"( '( !( # #( $!# ( ( #( ( !( #( Playful Surveys on Social Networks !$"# "$## ( The playful questionnaire “Who’s got it?” was part of a study that examined the effect of retouching on the perception of human faces, by measuring according estimates of wealth and success.The survey encompassed three runs per trial with each participant. For each run, sets of three random portraits of different models of the same sex and with different retouch levels were presented to the participants in the context of a playful social network application. The participants then had to rate how successful they thought the depicted persons were. This was done by setting up three different values with three identical sliders in the range from 0 (unsuccessful) to 10 (successful). Figure 4: Age distribution for the 2453 participants. 14% did not reply to the question. The age group of the person who seeded the experiment on the network was 21-25 In order to ensure data stability, only the first contribution of each participant was taken into account. In a lab-based pre-study with 10 participants, the first response was always described as most truthful. $ ( %( $( #( "( !( #( !( ( ( ( Figure 3: Distribution of gender within the 2453 participants. N/A indicates that no answer was given. The gender of the person who seeded the experiment on the network was male. The portion of the experiment reported herein was conducted in a five month period. During this period, 2453 participants took part in the experiment and 676 of those were considered in the final evaluation. A participant was ignored if he/she could not be uniquely identified, did not Figure 2: Example of the end screen of “Who’s got it?”. The text is one of three different versions depending on the final score. Figure 1 shows a screenshot of the playful user inter-face of the experiment. After all three runs were completed, a 120 users also receive a playful personality report that contains the same text as in the playful version. provide their demographic information or did not complete the survey. The final group of contributors provided 13485 judgments on the 150 images. In the end, as only the first contribution was taken into account, 6084 of these judgments were analyzed. These numbers indicate that intrinsic motivation through playful surveys is an alternative to paid crowdsourcing. While the demographics of the experiment indicate a wide range of participants, they are shifted towards the age and gender of the user that announced the survey on the social network (Facebook). Some of the demographics are depicted in figure 3 and 4. The initial evaluation involved 20 students (10 males and 10 females), ranging from 20 – 35 years of age. Each subject completed both types of the survey in random order, each followed by a System Usability Scale Questionnaire (SUS). The participants then also completed a comparative feedback form about the different approaches. Playful Surveys Compared to Classic Electronic Surveys The second study was designed specifically to investigate the acceptance and perception, as well some potential benefits of playful surveys in comparison to more traditional digital survey tools. The playful questionnaire “Bake Your Personality” (Takhtamysheva et al. 2011) lets users answer six serious questions about their video gaming preferences. The questions are transported through the metaphorical concept of a cooking process, and answers are given by choosing ingredients for a final product (in this case a cake), as depicted in figure 5. At the end, the “cake” is submitted for tasting and the user receives a playful personality report based on the “taste” of the cake. This does not only add a playful element to the survey, but also serves to motivate users to be truthful with their answers, in order to receive correct personality reports. Figure 6: Screenshot of the standard questionnaire. The evaluation of the SUS questionnaire revealed a slightly better usability of the standard questionnaire at M=79 out of 90 possible points (SD=7) and M=69 out of 90 possible points (SD=15.3) for the playful questionnaire. The comparative feedback showed that the users self-reportedly did not feel significantly more or less distracted by either version of the survey (Wilcoxon Signed Rank test: Z=1.26, Mdn=4, p>.0, r=.21). Given the classification options “game”, “playful personality report”, “serious personality report”, “survey” and “undecided”, eight participants classified the playful survey as a “game” and the other twelve classified it as a “playful personality report”. Replies to the question “Which application would you prefer to use in the future?” revealed no significant result. Responses to the question “Which of the applications would you recommend to a friend on a social network?” revealed significant differences in replies (Z=-2.67 p<.00). The majority of participants were positive about spreading the playful questionnaire further to friends, while showing no intention of doing so regarding the normal digital questionnaire (cf. figure 7). Figure 5: Screenshot of the “Bake your Personality” application. Discussion of the Studies In order to evaluate the validity and reception of the playful questionnaire, a questionnaire with a generic visual style and with an identical set of questions was developed (figure 6). After completing the standard questionnaire, Considering the proposed method of playful surveys, the studies hint at both potential benefits and pitfalls. While “Who’s got it?” managed to attract a large number of participants with little effort of advertising the application, it took some effort in research and development and the per- 121 examples might not even be considered games, which is the reason to distinguish the approach by name, as to avoid confusion and raising false expectations. centage of omitted responses needs to be reduced. A prestudy had indicated that questions presented in a playful format need to be of some interest to the participants, which was be the case when asking users to guess the wealth of other people. The limits of this principle, however, require further research, since, e.g. asking users to judge the packaging of a cleaning product as part of marketing research might be perceived as work, even when such questions are delivered as part of a playful survey. The demographics depicted in figures 3 and 4 clearly show a sampling bias towards the demographic group of the seed. While such effects could likely be counterbalanced by using multiple seeds in a wide range of demographic groups, it might also be desired, e.g. in marketing research about products for specific target groups. Still, snowball sampling remains an important issue for surveys that are distributed over social networks and the spread into wider demographic reach over time, as well as the impact of the playful aspect on sampling effects, require further investigation. " The Impact of the Playful Aspect on HSR with Online Crowds As argued before, one challenge of HSR is the motivation of participants. It is not just a matter of motivating enough people to participate, but depending on their level and type of motivation, participants may answer questions differently or pay more or less attention to a task. With the proposed method, the motivation is intrinsic, since the participants are interested in the outcome for themselves. This may yield advantages as opposed to the extrinsic motivation achieved by paying CS workers to contribute (HsiuFen 2007). Another problem for HSR in paid crowdsourcing is that workers who quit tasks are punished by not getting paid. As Schmidt (2010) points out, this behavior maybe irreconcilable to institutional review board (IRB) regulations. This is not as problematic with the approach presented in this work. Quitting the survey is not punished in any way and is easily detectable. As no payment is involved, the risk of attracting spammers is relatively low. However, many participants of the final group (>50%) participated in the “Who’s got it?” survey more than once. This form of playful behavior to test different results must be expected and accounted for in playful environments. ' $ The concept of playful surveys on social networks has an impact on different aspects of HSR with online crowds. & Implications on Challenges of HSR with the Crowd $ ! ) ) The Impact of Social Network Propagation on HSR with Online Crowds Figure 7: Responses to the question: “Which of the applications would you recommend to a friend on a social network?” In paid crowdsourcing, a pre-selection of participants is often necessary in order to meet budget constraints both in terms of time and money. Surveys that are intrinsically motivating such as the proposed system are not tied to this constraint, as no money is paid to participants. The case of “Who’s got it?” illustrates how a playful survey that is seeded to a few contacts can reach a lot of participants. The results of the initial comparative study between the playful “Bake Your Personality” survey and a normal version of the same survey indicate that users can clearly distinguish playful surveys from normal surveys and attribute playful aspects to them. At the same time, the respondents did not believe that the style of the application did distract them more or less from the task of completing a survey in either version. They were significantly more positive about recommending the playful questionnaire to friends on a social network than they were about the normal survey. This is encouraging, since the motivating potential, the playfulness, seems to be positively perceived. Another key issue for many human subject studies is the reliability of the demographic data of the participants. With 97-99% of social network users providing truthful information (Acquisti et al. 2006, Young et al. 2009), it is possible to acquire personal information much richer than what is possible with current CS platforms. Groups of participants from large social networks are also potentially more representative of the general population than CS workers. While playful surveys could be considered a case of gamification, according well-known literature on games (cf. e.g. Crawford 2003, Salen et al. 2004), the presented 122 to participants and IRBs. Transparency can be achieved by excluding data from being used / stored and since the process is technologically predetermined, privacy issues can theoretically be handled just as cleanly as with lab studies. This can, however, only be ensured to a certain extend and with secured connections between the participants client and the servers. Encrypted data transfer was not used in the examples presented herein. A further crucial issue in many studies is access control, for instance when follow-up trials are required. Such restrictions can be implemented by relying on user ids provided by social networks. In the case of follow-up trials, social networks also allow for embedded applications to send messages to users, which can be employed for sending out reminders. Profile information allows researchers to check for certain skills and knowledge as social networks provide this information to applications. This data allows for pre-selecting participants, which for instance is required in studies that need to match multiple subjects. The decision which data is used and which is stored, or whether it is anonymized is technically determined, which makes studies with online crowds transparent to institutional review boards. Besides these major challenges, some issues remain in the areas where improvements were achieved. Concerning participant pools, some limitations, such as detailed control over when and under which conditions a participant can reaccess a task for a follow-up study, or directly controlling sampling bias, remain. Social network relationship graphs are not only interesting for human subject research per-se, but could also be employed to get a better understanding of issues related to participant pools. If the overall composition of the members of a social network group is known, a measure of the random sampling accuracy of participants based on demographic data can be established with large implications concerning the feasible statistics of HSR results. In addition to these two larger areas, the challenge of CS workers participating in online research with problematic expectations is evaded. Drawn together, playful studies in conjunction with their deployment via social networks can help tackling some of the challenges in the areas of participant pools, detecting bad behavior, ethical and legal considerations as well as managing participant expectations. Detecting bad behavior remains difficult outside the context of controlled laboratory studies. Currently detection of spammers and malicious behavior on common CS platforms focuses on answer correctness based on majority vote (Ipeirotis et al. 2010). This method is often unsuitable for HSR when there is not a single correct response to tasks. Instead, intra-rater reliability tests that provide consistency values for each contributor, based on consistency between votes of the contributor over time and gold data could be employed. Lastly, while problems with crowdworkers expecting more typical CS tasks than what they are facing as participants of HSR is not an issue, the impact of expectations that arise due to the promise and/or experience of playful elements is itself subject to future research, as is the development and testing of alternative non-disruptive playful aspects. Discussion & Future Work The method proposed above shows the potential to overcome some issues of crowd-sourced human subject research. However, a number of challenges remain, especially in the areas of ethical and legal considerations, study design tools and study management tools. The lack of tools is not practically solved by introducing new methods, as long as there is no template implementation, which leads to the most prominent challenge with the approach presented herein: The necessity to program. While the projects described above were carried out to a large extend by non-expert programmers, the required level of technical skills may be intimidating to practitioners from less technology affine research disciplines. While this challenge may inspire new collaborations between research fields it may well be down to the established CS platforms to eventually improve the situation by providing templates for more complex projects. Conclusion Traditionally, human subject studies are complex and expensive processes, in which participants are most often extrinsically motivated. New approaches are actively exploring the opportunities of employing crowdsourcing platforms for HSR. While the benefits of CS are dramatically reduced costs and instant access to large pools of potential participants, the mostly extrinsic motivation, limited freedom of study presentation and reliability still pose challenges. The two studies that are described above demonstrate that playful elements, coupled with deployment on social networks can motivate potential participants and Furthermore, this paper provides a method that will only work for a certain range of research problems, as not every study can be as easy integrated into a playful system. While improving access to additional variables and controls that can be accessed through modern browsers will allow for more advanced setups, experiments that require physically present face-to-face interaction between participants or physical artifacts for example, are not suitable for online research. Concerning the ethical and legal considerations, data access and gathering must be transparent both 123 Mahajan, A., (2011), “Building Big Social Games”, Presentation at Game Developers Conference, GDC2011. address some of the challenges in the areas of participant pools, detecting bad behavior, ethical and legal considerations, as well as managing participant expectations. Mason, W., and Watts, D. J., (2009), “Financial incentives and the performance of crowds”, HComp’09 Proceedings of the ACM SIGKDD Workshop on Human Computation, 11(2), 100–108. Acknowledgments This work was partially funded by the Klaus Tschira Foundation through the graduate school “Advances in Digital Media”. Nam, K. K., Ackerman, M. S., and Adamic, L. A., (2009), “Questions in, knowledge in?: a study of naver’s question answering community”, Proceedings of the 27th international conference on Human factors in computing systems, 779–788. References Paolacci, G., Chandler, J., & Ipeirotis, P. G., (2010), “Running experiments on amazon mechanical turk”, Judgment and Decision Making, 5(5), 411–419. Alonso, O., Rose, D. E., and Stewart, B. (2008), “Crowdsourcing for relevance evaluation”, ACM SIGIR Forum, 42, 9-15. Quinn, A.J. and Bederson, B.B., (2011), “Human computation: a survey and taxonomy of a growing field”, CHI ’11 Proceedings of the 2011 annual conference on Human factors in computing systems. Andreasen, M. S., Nielsen, H. V., Schrøder, S. O., and Stage, J. (2007). “What happened to remote usability testing?: an empirical study of three methods”, CHI ’07 Proceedings of the SIGCHI conference on Human factors in computing systems. Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., and Tomlinson, B., (2010), ”Who are the crowdworkers?: shifting demographics in mechanical turk”, Proceedings of the 28th of the international conference extended abstracts on Human factors in computing systems, 2863–2872. Acquisti, A. and Gross, R., (2006) “Imagined communities: awareness, information sharing and privacy protection on the Facebook”, In Proceedings of the 6th Workshop on Privacy Enhancing Technologies. Salen, K., & Zimmerman, E. (2004). “Rules of Play”. Cambridge, MA, USA: MIT Press. Schmidt, L.A., (2010), “Crowdsourcing for human subjects research”, CrowdConf’10 Proceedings of the 1st International Conference on Crowdsourcing. Crawford, C. (2003). “Chris Crawford on game design”. Berkeley, CA, USA: New Riders Publishing. Downs, J. S., Holbrook, M. B., Sheng, S., and Cranor, L. F. (2010). “Are your participants gaming the system?: screening mechanical turk workers”, Proceedings of the 28th international conference on Human factors in computing systems, 2399–2402. Stolee, K. T., and Elbaum, S., (2010), “Exploring the use of crowdsourcing to support empirical studies in software engineering”, Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, 35. Fogg, B., Marshall, J., Kameda, T., Solomon, J., Rangnekar, A., Boyd, J., and Brown, B. (2001). “Web credibility research: a method for online experiments and early study results”, CHI’01 extended abstracts on Human factors in computing systems, 295–296. Takhtamysheva, A., Krause, M., and Smeddnick, J., (2011), “Serious Questionnaires in Playful Social Network Applications”, Proceedings of the 10th Inernational Conference on Entertainment Computing (ICEC’11), 6243, 4. Young A. L., and Quan-Haase, Anabel. (2009) “Information revelation and internet privacy concerns on social network sites: a case study of facebook”, In Proceedings of the fourth international conference on Communities and technologies, 265-274. Frei, B., (2009), “Paid crowdsourcing: Current state & progress toward mainstream business use”, Produced by Smartsheet.com. Hsiu-Fen, L. (2007) “Effects of extrinsic and intrinsic motivation on employee knowledge sharing intentions”, J. Inf. Sci. 33, 2, 135-149. Horton, J. J., and Chilton, L. B. (2010), “The labor economics of paid crowdsourcing”, Proceedings of the 11th ACM conference on Electronic commerce, 209–218. Horton, J. J., and Zeckhauser, R. J. (2010) “The potential of online experiments”, Working Paper Ipeirotis, P.G., Provost, F., and Wang, J. (2010), “Quality management on amazon mechanical turk”, HComp’10 Proceedings of the ACM SIGKDD Workshop on Human Computation. Kittur, A., Chi, E. H., and Suh, B. (2008), “Crowd-sourcing User Studies With Mechanical Turk”, Interfaces, 453-456. 124