Playful Surveys: Easing Challenges of Human Subject Research with Online Crowds

advertisement
Human Computation
AAAI Technical Report WS-12-08
Playful Surveys:
Easing Challenges of Human Subject Research with Online Crowds
Markus Krause, Jan Smeddinck, Aneta Takhtamysheva, Velislav Markov, Nina Runge
University of Bremen, TZI, Bremen, Germany
phateon@tzi.de, smeddinck@tzi.de, aneta@tzi.de, velislavmarkov@yahoo.com, nr@tzi.de
In recent years, it has become a reasonable alternative to
carry out human subject research via web-based systems
(Horton et. al. 2010). Different attempts were made to apply crowdsourcing (CS) to support empirical studies in
software engineering (Stolee et al. 2010) or user studies
(Kittur et al. 2008) in general by employing the established
crowdsourcing platforms (e.g. Amazon Mechanical Turk ,
CrowdFlower or MicroTask ) as tools for acquiring participants, structuring procedures and presenting experiment
interfaces. Schmidt (2010) describes a number of challenges for HSR on such platforms, which evolve around the
fact that the established CS systems were not designed
with HSR in mind. While current crowdsourcing platforms
are limited in terms of their suitability for HSR, they are
still promising means for large-scale and comparatively
low-cost studies. They allow for fast and parallel access to
large groups of participants that used to be out of reach for
many researchers in the past.
Abstract
A major challenge of human subject research lies in motivating enough subjects to participate in studies. Traditionally, participants are extrinsically motivated, for example by
getting paid for their contribution. Together with the effort
of organizing and supervising experiments, this renders human subject research either very expensive, or reduces the
validity due to small sample sizes. This work describes the
method of utilizing playful web-based surveys to intrinsically motivate contributors to participate in studies and illustrates the approach with two examples: a study of the effect
of retouching portraits on the perception of human faces and
according estimates of wealth and success that was distributed via a single announcement on a social network and attracted more than 2400 participants within a five months period, as well as a study on the perception of a questionnaire
in the form of a playful survey as compared to a more traditional online questionnaire, which showed that participants
are more likely to recommend playful surveys to friends
than normal surveys.
Keywords: human subject research, crowdsourcing,
survey, motivation, social media
In the light of these promises, a substantial amount of research projects and businesses have turned to collecting
input from web-users. The tools that have been employed
range from well-established online survey tools (e.g. SurveyMonkey, LimeSurvey, or uTest) to online experiments
(Fogg et al. 2001) and remote usability testing (Andreasen
et al. 2007). While the range of potential participants theoretically encompasses every Internet user, practitioners
often times still have to manually recruit subjects or end up
with a small amount of participants (Kittur et al. 2008).
Due to increasingly frequent customer feedback requests,
survey fatigue has a considerable and potentially biasing
effect. Besides acquiring sufficient numbers of the right
participants, detecting malicious behavior and legal and
ethical considerations also pose substantial challenges.
Introduction
A common problem of human subject research (HSR) is
the acquisition of large numbers of participants for studies
and meeting sampling requirements within that group. This
applies for many research fields such as social and political
science, psychology, public health, market research, HCI,
and others. Practitioners in these fields face the challenge
to recruit and schedule participants for their studies in a
timely and cost-effective manner. They have to pay for
space, equipment and assistants to advertise, schedule, and
run their studies. This becomes even more demanding if
the participants are not locals.
However, the central challenge for surveys still remains
to be motivation. By now the power of computer and video
games to intrinsically motivate people to participate in se-
Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
118
rious activities is well known. In this light, we propose a
method for HSR with online crowds called playful surveys.
repetitive small tasks relayed to mostly anonymous workers that are the standard form of tasks on most CS systems.
Such surveys include playful elements that resemble
simple game-mechanics or storytelling, in order to attract
participants that are motivated to contribute because it is
fun. The two examples presented herein build upon a playful element that is similar to common personality quizzes
in magazines. In such quizzes, the participant gets a certain
score for each answer. By summing up these individual
scores, the reader receives a final score that is connected to
one of a number of different personality assessments. In an
informal pre-study only few people considered these questionnaires serious, yet the participants still claimed to give
truthful answers at least while doing the questionnaire for
the first time.
The common concerns of HSR can be classified into
challenges related to (cf. Schmidt 2010):
The (1) participant pool, including considerations about
participant demographics, motivation, past research experience, skills and knowledge and pre-selection. HSR often
requires careful participant selection regarding multiple
parameters that are defined by the purpose of each study.
Most commonly filtering happens across age, gender, educational level, interests, skills, and other factors. Current
crowdsourcing platforms only provide very limited demographic information (Alonso et al. 2008). It is also important for researchers to be aware of potential sampling
bias, because CS workers are neither representative of their
national, nor of the global population (Paolacci et al. 2010,
Ross et al. 2010). Since CS workers are typically motivated to provide their services for financial remuneration, they
are interested in fulfilling short and simple tasks that will
give them opportunity to accomplish as many tasks as possible (Mason et al. 2009), which may impact the quality of
human subject studies responses. Lastly, human subject
studies oftentimes require establishing enduring contacts
with the participants to carry out follow-up tests or run
series of surveys. This requires access to the contact or
profile information of selected groups of participants who
expressed interest in longer cooperation, which is difficult
to achieve with current CS platforms.
The first study examined the effect that digital image retouching of portraits photographs has on the perception of
the depicted human faces. In the light of the methodological focus of this paper, the results of the experiment are
promising especially in terms of participant numbers. The
study was announced and published via a single Facebook
account, attracting more than 2400 participants within a
five month trial period and without any further advertising.
The second study aimed to compare the perception of
playful surveys to the perception of more traditional electronic surveys.
(2) Detecting bad behavior: The anonymous nature of
typical CS tasks creates space for potentially malicious or
unethical behavior (Downs et al. 2010). Spamming and
negligent work when quality is unverifiable have been
some of the major issues of crowdsourcing and human
computation platforms. Quality control mechanisms like
majority vote (Ipeirotis et al. 2010) and gold questions can
significantly increase the quality of the results (Kittur et al.
2008). However, more sophisticated mechanisms are needed to detect and prevent malicious behavior in HSR where
there frequently are no absolute “correct” responses.
Figure 1: UI Screenshot of the presented experiment. The three
different images are always of different retouch levels but of the
same sex.
Amongst other aspects, the question of fair financial remuneration for the efforts of crowd-workers raises (3) ethical and legal considerations. While underpay is subject to
employer ethics, higher rates create pressure to participate
depending on financial need of the participant. This, in turn
may result in sampling bias and uncommitted participation
(Frei 2009, Horton et al. 2010).
Challenges of Human Subject Research with
the Crowd
Focusing on the classic crowdsourcing approach with paid
workers on existing platforms, Schmidt (2010) highlights
that HSR tasks differ from typical CS tasks and are often
difficult to implement with the templates and structures
that the established platforms provide. The variety of research fields and topics that utilize HSR entails complex
requirements that cannot be tackled with the batches of
(4) Study design tools and (5) study management tools
are not yet readily available for the major CS platforms and
cannot easily be implemented due to a lack of control over
the environment in which CS tasks are deployed.
119
Finally, (6) managing participant expectations may be
an issue due to the differences between typical CS tasks
and typical HSR tasks. Additional clarification of the task
should be provided to the workers. Still participants’ expectations of different types of tasks and goals of the creator of the task may increase the risk of bias and noise of the
results.
final “evaluation” screen was displayed as shown in figure
2. The screen showed one of three different texts. The responses provided by the participants during the survey
determine which text is shown. Each time a participant
ranks an image, the distance to the majority vote for that
image is calculated. In the end all differences are summed
up giving a final result. This result determines which text is
shown. The closer the participants answers are to the majority vote, the more positive the text will be. Receiving
such a personality report adds a playful notion to participating in the survey.
Despite of accurate accounts of the requirements of HSR
practitioners and the relating shortcomings in the offers of
current CS systems, there is a lack of concrete approaches
to systematically deal with the challenges stated above.
The following section provides an account of a study that
was deployed on a social network and employed playful
elements, resulting in the concept of playful surveys as a
strategy that can ease some of the issues with participant
pools, detecting and preventing bad behavior, ethical and
legal considerations as well as managing participant expectations.
"(
!#(
!"(
'(
!(
#
#(
$!#
(
(
#(
(
!(
#(
Playful Surveys on Social Networks
!$"#
"$##
(
The playful questionnaire “Who’s got it?” was part of a
study that examined the effect of retouching on the perception of human faces, by measuring according estimates of
wealth and success.The survey encompassed three runs per
trial with each participant. For each run, sets of three random portraits of different models of the same sex and with
different retouch levels were presented to the participants
in the context of a playful social network application. The
participants then had to rate how successful they thought
the depicted persons were. This was done by setting up
three different values with three identical sliders in the
range from 0 (unsuccessful) to 10 (successful).
Figure 4: Age distribution for the 2453 participants. 14% did not
reply to the question. The age group of the person who seeded the
experiment on the network was 21-25
In order to ensure data stability, only the first contribution
of each participant was taken into account. In a lab-based
pre-study with 10 participants, the first response was always described as most truthful.
$ (
%(
$(
#(
"(
!(
#(
!(
(
(
(
Figure 3: Distribution of gender within the 2453 participants.
N/A indicates that no answer was given. The gender of the person
who seeded the experiment on the network was male.
The portion of the experiment reported herein was conducted in a five month period. During this period, 2453
participants took part in the experiment and 676 of those
were considered in the final evaluation. A participant was
ignored if he/she could not be uniquely identified, did not
Figure 2: Example of the end screen of “Who’s got it?”. The
text is one of three different versions depending on the final score.
Figure 1 shows a screenshot of the playful user inter-face
of the experiment. After all three runs were completed, a
120
users also receive a playful personality report that contains
the same text as in the playful version.
provide their demographic information or did not complete
the survey. The final group of contributors provided 13485
judgments on the 150 images. In the end, as only the first
contribution was taken into account, 6084 of these judgments were analyzed. These numbers indicate that intrinsic
motivation through playful surveys is an alternative to paid
crowdsourcing. While the demographics of the experiment
indicate a wide range of participants, they are shifted towards the age and gender of the user that announced the
survey on the social network (Facebook). Some of the demographics are depicted in figure 3 and 4.
The initial evaluation involved 20 students (10 males
and 10 females), ranging from 20 – 35 years of age. Each
subject completed both types of the survey in random order, each followed by a System Usability Scale Questionnaire (SUS). The participants then also completed a comparative feedback form about the different approaches.
Playful Surveys Compared to Classic Electronic Surveys
The second study was designed specifically to investigate
the acceptance and perception, as well some potential benefits of playful surveys in comparison to more traditional
digital survey tools. The playful questionnaire “Bake Your
Personality” (Takhtamysheva et al. 2011) lets users answer
six serious questions about their video gaming preferences.
The questions are transported through the metaphorical
concept of a cooking process, and answers are given by
choosing ingredients for a final product (in this case a
cake), as depicted in figure 5. At the end, the “cake” is
submitted for tasting and the user receives a playful personality report based on the “taste” of the cake. This does
not only add a playful element to the survey, but also
serves to motivate users to be truthful with their answers,
in order to receive correct personality reports.
Figure 6: Screenshot of the standard questionnaire.
The evaluation of the SUS questionnaire revealed a slightly
better usability of the standard questionnaire at M=79 out
of 90 possible points (SD=7) and M=69 out of 90 possible
points (SD=15.3) for the playful questionnaire. The comparative feedback showed that the users self-reportedly did
not feel significantly more or less distracted by either version of the survey (Wilcoxon Signed Rank test: Z=1.26,
Mdn=4, p>.0, r=.21). Given the classification options
“game”, “playful personality report”, “serious personality
report”, “survey” and “undecided”, eight participants classified the playful survey as a “game” and the other twelve
classified it as a “playful personality report”. Replies to the
question “Which application would you prefer to use in the
future?” revealed no significant result.
Responses to the question “Which of the applications
would you recommend to a friend on a social network?”
revealed significant differences in replies (Z=-2.67 p<.00).
The majority of participants were positive about spreading
the playful questionnaire further to friends, while showing
no intention of doing so regarding the normal digital questionnaire (cf. figure 7).
Figure 5: Screenshot of the “Bake your Personality” application.
Discussion of the Studies
In order to evaluate the validity and reception of the playful questionnaire, a questionnaire with a generic visual
style and with an identical set of questions was developed
(figure 6). After completing the standard questionnaire,
Considering the proposed method of playful surveys, the
studies hint at both potential benefits and pitfalls. While
“Who’s got it?” managed to attract a large number of participants with little effort of advertising the application, it
took some effort in research and development and the per-
121
examples might not even be considered games, which is
the reason to distinguish the approach by name, as to avoid
confusion and raising false expectations.
centage of omitted responses needs to be reduced. A prestudy had indicated that questions presented in a playful
format need to be of some interest to the participants,
which was be the case when asking users to guess the
wealth of other people. The limits of this principle, however, require further research, since, e.g. asking users to judge
the packaging of a cleaning product as part of marketing
research might be perceived as work, even when such
questions are delivered as part of a playful survey. The
demographics depicted in figures 3 and 4 clearly show a
sampling bias towards the demographic group of the seed.
While such effects could likely be counterbalanced by using multiple seeds in a wide range of demographic groups,
it might also be desired, e.g. in marketing research about
products for specific target groups. Still, snowball sampling remains an important issue for surveys that are distributed over social networks and the spread into wider
demographic reach over time, as well as the impact of the
playful aspect on sampling effects, require further investigation.
"
The Impact of the Playful Aspect on HSR with
Online Crowds
As argued before, one challenge of HSR is the motivation
of participants. It is not just a matter of motivating enough
people to participate, but depending on their level and type
of motivation, participants may answer questions differently or pay more or less attention to a task. With the proposed method, the motivation is intrinsic, since the participants are interested in the outcome for themselves. This
may yield advantages as opposed to the extrinsic motivation achieved by paying CS workers to contribute (HsiuFen 2007). Another problem for HSR in paid crowdsourcing is that workers who quit tasks are punished by not getting paid. As Schmidt (2010) points out, this behavior
maybe irreconcilable to institutional review board (IRB)
regulations. This is not as problematic with the approach
presented in this work. Quitting the survey is not punished
in any way and is easily detectable. As no payment is involved, the risk of attracting spammers is relatively low.
However, many participants of the final group (>50%)
participated in the “Who’s got it?” survey more than once.
This form of playful behavior to test different results must
be expected and accounted for in playful environments.
'
$
The concept of playful surveys on social networks has an
impact on different aspects of HSR with online crowds.
&
Implications on Challenges
of HSR with the Crowd
$
!
) )
The Impact of Social Network Propagation on
HSR with Online Crowds
Figure 7: Responses to the question: “Which of the applications
would you recommend to a friend on a social network?”
In paid crowdsourcing, a pre-selection of participants is
often necessary in order to meet budget constraints both in
terms of time and money. Surveys that are intrinsically
motivating such as the proposed system are not tied to this
constraint, as no money is paid to participants. The case of
“Who’s got it?” illustrates how a playful survey that is
seeded to a few contacts can reach a lot of participants.
The results of the initial comparative study between the
playful “Bake Your Personality” survey and a normal version of the same survey indicate that users can clearly distinguish playful surveys from normal surveys and attribute
playful aspects to them. At the same time, the respondents
did not believe that the style of the application did distract
them more or less from the task of completing a survey in
either version. They were significantly more positive about
recommending the playful questionnaire to friends on a
social network than they were about the normal survey.
This is encouraging, since the motivating potential, the
playfulness, seems to be positively perceived.
Another key issue for many human subject studies is the
reliability of the demographic data of the participants. With
97-99% of social network users providing truthful information (Acquisti et al. 2006, Young et al. 2009), it is possible to acquire personal information much richer than
what is possible with current CS platforms. Groups of participants from large social networks are also potentially
more representative of the general population than CS
workers.
While playful surveys could be considered a case of
gamification, according well-known literature on games
(cf. e.g. Crawford 2003, Salen et al. 2004), the presented
122
to participants and IRBs. Transparency can be achieved by
excluding data from being used / stored and since the process is technologically predetermined, privacy issues can
theoretically be handled just as cleanly as with lab studies.
This can, however, only be ensured to a certain extend and
with secured connections between the participants client
and the servers. Encrypted data transfer was not used in the
examples presented herein.
A further crucial issue in many studies is access control,
for instance when follow-up trials are required. Such restrictions can be implemented by relying on user ids provided by social networks. In the case of follow-up trials,
social networks also allow for embedded applications to
send messages to users, which can be employed for sending out reminders. Profile information allows researchers
to check for certain skills and knowledge as social networks provide this information to applications. This data
allows for pre-selecting participants, which for instance is
required in studies that need to match multiple subjects.
The decision which data is used and which is stored, or
whether it is anonymized is technically determined, which
makes studies with online crowds transparent to institutional review boards.
Besides these major challenges, some issues remain in
the areas where improvements were achieved. Concerning
participant pools, some limitations, such as detailed control
over when and under which conditions a participant can reaccess a task for a follow-up study, or directly controlling
sampling bias, remain.
Social network relationship graphs are not only interesting for human subject research per-se, but could also be
employed to get a better understanding of issues related to
participant pools. If the overall composition of the members of a social network group is known, a measure of the
random sampling accuracy of participants based on demographic data can be established with large implications
concerning the feasible statistics of HSR results.
In addition to these two larger areas, the challenge of CS
workers participating in online research with problematic
expectations is evaded. Drawn together, playful studies in
conjunction with their deployment via social networks can
help tackling some of the challenges in the areas of participant pools, detecting bad behavior, ethical and legal considerations as well as managing participant expectations.
Detecting bad behavior remains difficult outside the context of controlled laboratory studies. Currently detection of
spammers and malicious behavior on common CS platforms focuses on answer correctness based on majority
vote (Ipeirotis et al. 2010). This method is often unsuitable
for HSR when there is not a single correct response to
tasks. Instead, intra-rater reliability tests that provide consistency values for each contributor, based on consistency
between votes of the contributor over time and gold data
could be employed. Lastly, while problems with crowdworkers expecting more typical CS tasks than what they
are facing as participants of HSR is not an issue, the impact of expectations that arise due to the promise and/or
experience of playful elements is itself subject to future
research, as is the development and testing of alternative
non-disruptive playful aspects.
Discussion & Future Work
The method proposed above shows the potential to overcome some issues of crowd-sourced human subject research. However, a number of challenges remain, especially in the areas of ethical and legal considerations, study
design tools and study management tools.
The lack of tools is not practically solved by introducing
new methods, as long as there is no template implementation, which leads to the most prominent challenge with the
approach presented herein: The necessity to program.
While the projects described above were carried out to a
large extend by non-expert programmers, the required level
of technical skills may be intimidating to practitioners from
less technology affine research disciplines. While this challenge may inspire new collaborations between research
fields it may well be down to the established CS platforms
to eventually improve the situation by providing templates
for more complex projects.
Conclusion
Traditionally, human subject studies are complex and expensive processes, in which participants are most often
extrinsically motivated. New approaches are actively exploring the opportunities of employing crowdsourcing platforms for HSR. While the benefits of CS are dramatically
reduced costs and instant access to large pools of potential
participants, the mostly extrinsic motivation, limited freedom of study presentation and reliability still pose challenges. The two studies that are described above demonstrate that playful elements, coupled with deployment on
social networks can motivate potential participants and
Furthermore, this paper provides a method that will only
work for a certain range of research problems, as not every
study can be as easy integrated into a playful system.
While improving access to additional variables and controls that can be accessed through modern browsers will
allow for more advanced setups, experiments that require
physically present face-to-face interaction between participants or physical artifacts for example, are not suitable for
online research. Concerning the ethical and legal considerations, data access and gathering must be transparent both
123
Mahajan, A., (2011), “Building Big Social Games”, Presentation at Game Developers Conference, GDC2011.
address some of the challenges in the areas of participant
pools, detecting bad behavior, ethical and legal considerations, as well as managing participant expectations.
Mason, W., and Watts, D. J., (2009), “Financial incentives
and the performance of crowds”, HComp’09 Proceedings of
the ACM SIGKDD Workshop on Human Computation, 11(2),
100–108.
Acknowledgments
This work was partially funded by the Klaus Tschira
Foundation through the graduate school “Advances in Digital Media”.
Nam, K. K., Ackerman, M. S., and Adamic, L. A., (2009),
“Questions in, knowledge in?: a study of naver’s question
answering community”, Proceedings of the 27th international
conference on Human factors in computing systems, 779–788.
References
Paolacci, G., Chandler, J., & Ipeirotis, P. G., (2010), “Running experiments on amazon mechanical turk”, Judgment and
Decision Making, 5(5), 411–419.
Alonso, O., Rose, D. E., and Stewart, B. (2008),
“Crowdsourcing for relevance evaluation”, ACM SIGIR Forum, 42, 9-15.
Quinn, A.J. and Bederson, B.B., (2011), “Human computation: a survey and taxonomy of a growing field”, CHI ’11
Proceedings of the 2011 annual conference on Human factors
in computing systems.
Andreasen, M. S., Nielsen, H. V., Schrøder, S. O., and Stage,
J. (2007). “What happened to remote usability testing?: an
empirical study of three methods”, CHI ’07 Proceedings of
the SIGCHI conference on Human factors in computing systems.
Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., and Tomlinson, B., (2010), ”Who are the crowdworkers?: shifting demographics in mechanical turk”, Proceedings of the 28th of
the international conference extended abstracts on Human
factors in computing systems, 2863–2872.
Acquisti, A. and Gross, R., (2006) “Imagined communities:
awareness, information sharing and privacy protection on the
Facebook”, In Proceedings of the 6th Workshop on Privacy
Enhancing Technologies.
Salen, K., & Zimmerman, E. (2004). “Rules of Play”.
Cambridge, MA, USA: MIT Press.
Schmidt, L.A., (2010), “Crowdsourcing for human subjects
research”, CrowdConf’10 Proceedings of the 1st International
Conference on Crowdsourcing.
Crawford, C. (2003). “Chris Crawford on game design”.
Berkeley, CA, USA: New Riders Publishing.
Downs, J. S., Holbrook, M. B., Sheng, S., and Cranor, L. F.
(2010). “Are your participants gaming the system?: screening
mechanical turk workers”, Proceedings of the 28th international conference on Human factors in computing systems,
2399–2402.
Stolee, K. T., and Elbaum, S., (2010), “Exploring the use of
crowdsourcing to support empirical studies in software engineering”, Proceedings of the 2010 ACM-IEEE International
Symposium on Empirical Software Engineering and Measurement, 35.
Fogg, B., Marshall, J., Kameda, T., Solomon, J., Rangnekar,
A., Boyd, J., and Brown, B. (2001). “Web credibility research:
a method for online experiments and early study results”,
CHI’01 extended abstracts on Human factors in computing
systems, 295–296.
Takhtamysheva, A., Krause, M., and Smeddnick, J., (2011),
“Serious Questionnaires in Playful Social Network Applications”, Proceedings of the 10th Inernational Conference on
Entertainment Computing (ICEC’11), 6243, 4.
Young A. L., and Quan-Haase, Anabel. (2009) “Information
revelation and internet privacy concerns on social network
sites: a case study of facebook”, In Proceedings of the fourth
international conference on Communities and technologies,
265-274.
Frei, B., (2009), “Paid crowdsourcing: Current state & progress toward mainstream business use”, Produced by Smartsheet.com.
Hsiu-Fen, L. (2007) “Effects of extrinsic and intrinsic motivation on employee knowledge sharing intentions”, J. Inf. Sci.
33, 2, 135-149.
Horton, J. J., and Chilton, L. B. (2010), “The labor economics
of paid crowdsourcing”, Proceedings of the 11th ACM conference on Electronic commerce, 209–218.
Horton, J. J., and Zeckhauser, R. J. (2010) “The potential of
online experiments”, Working Paper
Ipeirotis, P.G., Provost, F., and Wang, J. (2010), “Quality
management on amazon mechanical turk”, HComp’10 Proceedings of the ACM SIGKDD Workshop on Human Computation.
Kittur, A., Chi, E. H., and Suh, B. (2008), “Crowd-sourcing
User Studies With Mechanical Turk”, Interfaces, 453-456.
124
Download