Introduction CALVs Main idea Successful examples Claimed benefits Little study Wikipedia to set up questions/challenges Expertise (answered in earlier paper) What is quality (not to be answered here) Timing of checking (set up PEER vs WIKI) What about bad people and quality? (set up models) Motivating people (set up labels) Picking work (set up strategies) Our questions and contributions How is this different from CHI 2005? Important! Stress size, real world aspects Support for intelligently choosing tasks for people Mathematical model that suggests no improvement for peer, with empirical support. Design guidelines. Related work For some reason, I want this to be smaller than usual. Other studies of CALVs Discretionary databases, Slashdot and burn, Viegas et al, CHI 2005, mud/online games? Motivating contributors Studies/theory supporting telling (or not telling) people why Studies/theory supporting factors one might use in picking work CS inspiration Recommender systems, information filtering Experimental framework MovieLens, movie editing as context MovieLens basics How movies are edited now Soliciting contribution interface Basics, front page lists -- compare and contrast to recent changes on WPedia Three manipulations, to be discussed below. Movie choosing algorithms Labels or not PEER vs WIKI Metrics # of users who edited at least one movie – spread of work among people log(# of movies edited by a user + 1) – differences in work log(Flat, blank quality + 1) – effectiveness of work Survey Size, time, etc. Overall, # users, editors, movies shown/edited Expt 1: How does choosing tasks affect contributions? Four algorithms for choosing tasks Why a community might want to assign tasks Explain algos briefly – how are they reasonable practically and theoretically How the 4 strategies compare in the expt A summary of activity per strategy For # movies, RareRated wins overall and on the chosen list. NeedsWork is really bad. For # users who contribute, RareRated wins again – big time, overall (25% of users to 11%) and on chosen list (21% to 6%). For # fields corrected, per-movie score is highest for NeedsWork (makes sense), but RareRated again wins for total # fields corrected. A closer look at RareRated What the survey has to say People generally didn’t know what strategy was used, except for RareRated. People claimed they prefer to edit movies that need more editing – contra NeedsWork results (although also possible that people in that condition just didn’t take the survey!) People strongly prefer to do movies they’ve rated, and also like to edit movies that fewer people have rated – but the rating is stronger than the fewing: I prefer to edit movies few people rate rather than movies many people rate (agreement goes down left to right): 8% (10) 28% (33) 55% (65) 7% (8) 3% (3) 2.67 I prefer to edit movies I have rated rather than movies I haven't 29%(35) 44% (52) 20% (24) 3% (4) 3% (4) 2.08 Complaints about movies not changing led to realization that the strategies saw different #s of movies. These are unique movies, and whether they were edited when first shown in the chosen list to a given user. 0 4606 33 0.72% 1 3647 140 3.84% 2 8564 34 0.40% 3 11261 98 0.87% This means that our results above are probably understated(!) If we know the movie was rated, does rare matter? LR says that popularity gives us nothing in the rareRated sample – being rated higher or needing more work give more info. On the other hand, rating and needing of work are better predictors – especially rating. Note that this ignores psychological effects (i.e., the effect of seeing a set of movies chosen “just for me”). Beyond simple strategies: a logistic regression Can compute a number of stats in ML, like popularity, predicted liking, similarity, rating, average rating. Compare a LR built from these to a LR built from just strategy: these do better. What factors are important? Rating, needing work. Mini-discussion Do these results translate to other contexts? Yes – IR, relevance, # page views could be stand-ins for popularity, preference, etc. Further, could imagine using # views to help estimate quality, decide when not to show a movie, etc. Expt 2: How does telling people why affect contributions? Part of me is wondering if we want to tell the labels story here, or if we might want to save it for a journal version. There’s already going to be a lot of stuff flying around. Prior work around disclosing info Salience is good, but… Seems to often discourage people to lay on specific motivations Our stuff not quite the same – helping people understand why we picked a movie for them isn’t the same as instilling motivation. The labels Label for each strategy How labels affected editing behavior. At first it was overwhelmingly “no labels better” (t-test on user counts) But, we feared this was because of the different tone of the labels. Redesigned labels. During the second half, no difference (I think, need to double-check this analysis.). No interactions between labels and strategy (ANOVA w/ interaction effects) Mini-discussion The problems with translating theory to design, including finding the right theories and making faithful transcriptions (fuller discussion in Bob’s journal paper) Expt 3: How does the timing of reviews affect contributions Prior work Hark back to the 2005 CHI paper: oversight vs. non The review systems Pre-hoc vs. post-hoc Survey People think others would prefer check after (dummies!), and that check after is faster People prefer check before for themselves, think it would have long run quality and make ML more valuable and useful to themselves. For checking overall, people claimed that checking didn’t really make them more or less confident, likely to edit, or careful when editing overall. (Was there a per-group difference?) Our very simple model (already written) – add note about the difference between approve/deny and 2nd person edits. Using the model to reason about results Basic results on # of edits, movies made live. Note “wasted” work. Observe quality difference between peer and wiki movies that went live: peer’s 2 nd edit did more. Graphs and comparison to model – estimate Vmax, gamma, beta from experimental data, and see how well the model fits. Mini-discussion Is the model reasonable? In the large, yes. Does it apply in many contexts? We think so. How can designers use the model? For now, just in the large – checking doesn’t provide much benefit by itself. However, thinking about how to alter gamma and beta may be fruitful. (It might turn out that checking increases gamma, which would increase overall quality) Discussion ML member reactions Limit-haters and script-man Methodological annoyances Trouble with field experiments: Designs that are reasonable systems and useful experimentally are hard: brief story of labels Warn about how data collected are imperfect. Resummarize contributions Soliciting work effective (how much better is ML now? Compare to guru) Strategy matters (Strategies that might benefit community like NeedsWork not best – must combine community’s and member’s needs) Checking qua checking, no benefit (But, checking might have a psychological benefit.) Future work Testing model in other datasets, domains. (Hard, though.) Extending to different kinds of work (experts vs. peers, perhaps). Using psych theory to reason about altering gamma, beta. Conclusion Coming importance of CALVs, need to get members on board, clever thinking about algorithms and interfaces can lead to success, value, happy people.