paper172-outline

advertisement
Introduction
CALVs
Main idea
Successful examples
Claimed benefits
Little study
Wikipedia to set up questions/challenges
Expertise (answered in earlier paper)
What is quality (not to be answered here)
Timing of checking (set up PEER vs WIKI)
What about bad people and quality? (set up models)
Motivating people (set up labels)
Picking work (set up strategies)
Our questions and contributions
How is this different from CHI 2005? Important!
Stress size, real world aspects
Support for intelligently choosing tasks for people
Mathematical model that suggests no improvement for peer, with empirical support.
Design guidelines.
Related work
For some reason, I want this to be smaller than usual.
Other studies of CALVs
Discretionary databases, Slashdot and burn, Viegas et al, CHI 2005, mud/online games?
Motivating contributors
Studies/theory supporting telling (or not telling) people why
Studies/theory supporting factors one might use in picking work
CS inspiration
Recommender systems, information filtering
Experimental framework
MovieLens, movie editing as context
MovieLens basics
How movies are edited now
Soliciting contribution interface
Basics, front page lists -- compare and contrast to recent changes on WPedia
Three manipulations, to be discussed below.
Movie choosing algorithms
Labels or not
PEER vs WIKI
Metrics
# of users who edited at least one movie – spread of work among people
log(# of movies edited by a user + 1) – differences in work
log(Flat, blank quality + 1) – effectiveness of work
Survey
Size, time, etc.
Overall, # users, editors, movies shown/edited
Expt 1: How does choosing tasks affect contributions?
Four algorithms for choosing tasks
Why a community might want to assign tasks
Explain algos briefly – how are they reasonable practically and theoretically
How the 4 strategies compare in the expt
A summary of activity per strategy
For # movies, RareRated wins overall and on the chosen list. NeedsWork is really bad.
For # users who contribute, RareRated wins again – big time, overall (25% of users to 11%) and on chosen
list (21% to 6%).
For # fields corrected, per-movie score is highest for NeedsWork (makes sense), but RareRated again wins
for total # fields corrected.
A closer look at RareRated
What the survey has to say
People generally didn’t know what strategy was used, except for RareRated.
People claimed they prefer to edit movies that need more editing – contra NeedsWork results (although
also possible that people in that condition just didn’t take the survey!)
People strongly prefer to do movies they’ve rated, and also like to edit movies that fewer people have rated
– but the rating is stronger than the fewing:
I prefer to edit movies few people rate rather than movies many people rate (agreement
goes down left to right):
8% (10) 28% (33)
55% (65)
7% (8) 3% (3) 2.67
I prefer to edit movies I have rated rather than movies I haven't 29%(35) 44% (52)
20% (24)
3% (4) 3% (4) 2.08
Complaints about movies not changing led to realization that the strategies saw different #s of movies.
These are unique movies, and whether they were edited when first shown in the chosen list to a given user.
0
4606
33
0.72%
1
3647
140
3.84%
2
8564
34
0.40%
3
11261
98
0.87%
This means that our results above are probably understated(!)
If we know the movie was rated, does rare matter?
LR says that popularity gives us nothing in the rareRated sample – being rated higher or needing more
work give more info. On the other hand, rating and needing of work are better predictors – especially
rating.
Note that this ignores psychological effects (i.e., the effect of seeing a set of movies chosen “just for me”).
Beyond simple strategies: a logistic regression
Can compute a number of stats in ML, like popularity, predicted liking, similarity, rating, average rating.
Compare a LR built from these to a LR built from just strategy: these do better. What factors are
important? Rating, needing work.
Mini-discussion
Do these results translate to other contexts? Yes – IR, relevance, # page views could be stand-ins for
popularity, preference, etc.
Further, could imagine using # views to help estimate quality, decide when not to show a movie, etc.
Expt 2: How does telling people why affect
contributions?
Part of me is wondering if we want to tell the labels story here, or if we might want to save it for a journal
version. There’s already going to be a lot of stuff flying around.
Prior work around disclosing info
Salience is good, but…
Seems to often discourage people to lay on specific motivations
Our stuff not quite the same – helping people understand why we picked a movie for them isn’t the same as
instilling motivation.
The labels
Label for each strategy
How labels affected editing behavior.
At first it was overwhelmingly “no labels better” (t-test on user counts)
But, we feared this was because of the different tone of the labels.
Redesigned labels.
During the second half, no difference (I think, need to double-check this analysis.).
No interactions between labels and strategy (ANOVA w/ interaction effects)
Mini-discussion
The problems with translating theory to design, including finding the right theories and making faithful
transcriptions (fuller discussion in Bob’s journal paper)
Expt 3: How does the timing of reviews affect
contributions
Prior work
Hark back to the 2005 CHI paper: oversight vs. non
The review systems
Pre-hoc vs. post-hoc
Survey
People think others would prefer check after (dummies!), and that check after is faster
People prefer check before for themselves, think it would have long run quality and make ML more
valuable and useful to themselves.
For checking overall, people claimed that checking didn’t really make them more or less confident, likely
to edit, or careful when editing overall. (Was there a per-group difference?)
Our very simple model
(already written) – add note about the difference between approve/deny and 2nd person edits.
Using the model to reason about results
Basic results on # of edits, movies made live. Note “wasted” work.
Observe quality difference between peer and wiki movies that went live: peer’s 2 nd edit did more.
Graphs and comparison to model – estimate Vmax, gamma, beta from experimental data, and see how well
the model fits.
Mini-discussion
Is the model reasonable? In the large, yes. Does it apply in many contexts? We think so.
How can designers use the model? For now, just in the large – checking doesn’t provide much benefit by
itself. However, thinking about how to alter gamma and beta may be fruitful. (It might turn out that
checking increases gamma, which would increase overall quality)
Discussion
ML member reactions
Limit-haters and script-man
Methodological annoyances
Trouble with field experiments:
Designs that are reasonable systems and useful experimentally are hard: brief story of labels
Warn about how data collected are imperfect.
Resummarize contributions
Soliciting work effective (how much better is ML now? Compare to guru)
Strategy matters (Strategies that might benefit community like NeedsWork not best – must combine
community’s and member’s needs)
Checking qua checking, no benefit (But, checking might have a psychological benefit.)
Future work
Testing model in other datasets, domains. (Hard, though.)
Extending to different kinds of work (experts vs. peers, perhaps).
Using psych theory to reason about altering gamma, beta.
Conclusion
Coming importance of CALVs, need to get members on board, clever thinking about algorithms and
interfaces can lead to success, value, happy people.
Download