Evaluation - Stanford HCI

advertisement
Evaluation
Eyal Ophir
CS 376
4/28/09
Readings
Methodology Matters (McGrath, 1994)
Practical Guide to Controlled Experiments
on the Web (Kohavi et al., 2007)
Methodology Matters
Methodology Matters
 Methods for Research in the Behavioral and
Social Sciences
 Different methods have strengths and
weaknesses
 Tradeoff between:
Generalizability
Precision
Realism
 Credibility requires consistency, convergence
across methods
Study Design
Find baserates, correlations, or differences
Randomization of selection, assignment to
conditions
Statistical significance
Validity (internal, statistical, construct,
external)
Measures
Self report
Trace measures
Observation (by a visible or hidden
observer)
Archival records (public or private)
Manipulation
Selection
Direct intervention
Induction (indirect intervention:
confederates, deception)
Case Study: Multitasking UI
Users play two simultaneous instantiations
of a game
Does making the two instantiations
visually different make it easier to switch
back and forth?
Case Study
Case Study
Case Study
•
Tradeoffs: Generalizability,
Precision, Realism
•
Design: baserates, correlations,
differences
•
Random selection, assignment
•
Validity: internal, statistical,
construct, external
•
Measures: self-report, trace
measures, observation, archival
records
•
Manipulation: selection,
intervention, induction
General Question
Has social psychology resisted formal
theory, and if so, why?
Practical Guide to
Controlled Experiments
on the Web
Web Experiments
OEC: Overall Evaluation Criterion
Web Experiments
Hypothesis testing and sample size
Confidence, power
Reducing the standard error
Sufficiently large sample size
OEC with inherently low variability
Reduce variability by excluding irrelevant cases
Web Experiments
Extensions for Online Experiments
Treatment ramp-up
Automation
Software Migration
Web Experiments
Limitations of web experiments
No explanation of mechanism
Focus on short term effects
Primacy/newness
Must implement treatments
Web Experiments
Implementation
Randomization
Pseudorandom with caching
Hash and partition
Assignment
Traffic splitting
Server-side
Client-side
Lessons learned (i.e.- tips for
the researcher):
Analysis
Mine the Data
Time matters
Multi-factor experiments
Lessons Learned
Trust and Execution
Run A/A tests (test your system)
Ramp-up and abort
Correct sample size
Assign 50% to treatment
Beware day of week effects
Lessons Learned
Culture and Business
Agree on OEC upfront
Beware “harmless” features
Weigh performance vs. maintenance cost
Data-driven (vs. opinion-driven) culture
Extended Case Study
Assume the game UI from the first case
study was an actual gaming site
The website is interested in promoting
multiple simultaneous games between
users, but users complain that it’s difficult
to manage multiple games
Design a web-based study informed by the
reading to test the new design
Case Study
•
OEC
•
Sample size, reducing error
•
Ramp-up, automation
•
Mechanism explanation, short vs.
long-term effects,
primacy/newness
•
Randomization/assignment
•
Mine the data, multi-factor
experiments
•
A/A tests, sample size, day of
week effects
Data-Oriented Culture
Pros?
Cons?
How can we best use user tests to inform
design and innovation?
Trade-offs of experimentation vs. intuition
Why the OEC? What are good measures
for non-commerce sites?
Do online tests maximize all McGrath’s
parameters?
Download