02-hypothesis-lifecycle

advertisement
Statistical Methods in
Computer Science
Hypothesis Life-cycle
Ido Dagan
Why to experiment?
•
W. Tichy, “Should Computer Scientists Experiment More?”
(on course web page)
System/Model/theory testing
– Identify incorrectness, incompleteness in your “theory”/assumptions
• This can save money and lives!
– e.g. underlying assumptions that are violated by reality
– Can lead to revising model and/or system
•
Exploration
– Find new phenomena
– E.g. unknown user behaviors in using systems
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
2
Empirical Research Cycle

Established methodology, with very long tradition


Natural sciences, social sciences
Cycle:

Form theory/model


Hypothesize based on theory





More relevant pages higher than less relevant ones
Experiment (when possible)


E.g. search engine ranking function
Ask people to judge relevance (binary, score, relative, …)
Observe results
Find discrepancies between hypothesized predictions and results
Revise theory (and publish results)
This course covers especially [hypothesis .... discrepancy]

Heavy use of statistics and analytical skills (a bit of art)
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
3
Common Practice
• Vague idea
• No preliminary investigation
• No articulation of precise hypothesis
• Bad experimental design
• No iterations
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
4
Lots of Ways to Attack
Experimentation

Not general – only applies to the “system/setting under
test”.


Not forward-looking




motivations and observations based on the past.
Lack of representative comparison


E.g. general claims on user behavior true only for one system
inadequate benchmarks (users are happy with my system…)
difficult/costly to implement comparisons
Not enabling independent replication of experiments
Real data can be messy – difficult to choose which data
to gather

E.g. which aspects of user behavior (speed, satisfaction,
success,…)
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
5
Experimental Lifecycle
Vague idea
“groping around”
experiences
1. Understand the problem,
frame the questions,
articulate the goals.
A problem well-stated
is half-solved.
Initial
observations
Model/
Theory
Data, analysis,
interpretation
Results & final
Presentation
Statistical Methods in Computer Science
Hypothesis
Experiment
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
6
A Systematic Approach
1.
Understand the problem, frame the questions, articulate the goals.
A problem well-stated is half-solved.
•
Be able to answer “why” as well as “what”
• E.g. why people search? Find website? / Find information?
2.
Select metrics that will help answer the questions.
• Rank of correct website / Percentage or relevant pages in top 10
3.
Identify the parameters that affect behavior
• System parameters (e.g., HW config, search speed)
• Workload parameters (e.g., user request patterns)
• Data parameters (e.g. long/short documents)
4. Decide which parameters to study (vary in experiment)
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
7
Experimental Lifecycle
Vague idea
“groping around”
experiences
Initial
observations
Model/
Theory
Data, analysis,
interpretation
Results & final
Presentation
Statistical Methods in Computer Science
2. Select metrics that
will help answer the
questions.
3. Identify the
parameters that affect
behavior
Hypothesis
Experiment
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
9
Behavior Parameters/Variables
Example: software performance
Hardware parameters
 CPU model and organization, cache organization, latencies in the
system (these will affect running time)
System parameters
 Memory availability, usage
 CPU running time (sometimes approximated by world-clock time)
 Communication bandwidth, usage
 Program characteristics
 requires floating-point, heavy disk usage, integer math,
graphics
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
11
Now build a model (theory)



Mathematically precise
 Memory = 2*sizeof(input) + 3
 Runtime = 500 + 30*sizeof(input) + 20
Asymptotically correct
 Memory = O(sizeof(input)) in worst case,
 Runtime = O(log (sizeof(input))) in best case
 Accuracy is proportional to run-time
Qualitative
 User performance is increased with reduced cognitive load
 Number of bugs discovered is monotonically decreasing if the same
programmer is used, otherwise it increases
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
13
Now form hypothesis
Translate qualitative into quantitative

Use of new system will (these are different hypotheses):





+ Increase operator accuracy (compared to not using it) by X
- Decrease failures by Y
- Decrease performance time Z
Introducing link information to relevance score will increase ranking
quality by 10%
......
Operationalize the hypothesis
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
14
What can go wrong at this stage?




Wrong metrics (they don’t address the questions at hand)
 e.g., ads click through, rather than purchase
Bad metrics: too difficult to measure, too costly
Overlooking significant parameters that affect the system
Not clear about where the “system under test” boundaries are


E.g. poor ad content rather than poor ad matching
Unrepresentative test-setting.
 Not predictive of real usage.
 Just what everyone else uses (adopted blindly)
 NOT what anyone else uses (no comparison possible)
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
15
Experimental Lifecycle
Vague idea
“groping around”
experiences
Initial
observations
Model/
Theory
Data, analysis,
interpretation
Hypothesis
1.
Results & final
Presentation
Statistical Methods in Computer Science
Experiment
2.
3.
Decide which
parameters to vary
Select technique
Select measurements
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
16
A Systematic Approach
1. Decide which parameters to study (vary)
2. Select measurement technique:
•
•
Can we directly measure what we want?
Intrusive (invasive) versus unobtrusive measurement
• How invasive? Can we quantify interference of monitoring?
• E.g. should user mark relevance, or we just follow clicks?
•
•
•
Simulation – how detailed? Validated against what?
Benchmarks
Repeatability
3. Experiment design
– Lesion studies / ablation tests (with and without component)
– Iron-man (e.g. human performance), straw-man
– Baseline, ceilings and floors
– Factorial design
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
17
Portions © Carla Ellis at Duke University
Experimental Lifecycle
Vague idea
“groping around”
experiences
1.
2.
3.
Run experiments
Analyze and interpret
data
Data presentation
Initial
observations
Hypothesis
Data, analysis,
interpretation
Results & final
Presentation
Statistical Methods in Computer Science
Model
Experiment
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
18
A Systematic Approach
1. Run experiments
•
How many trials? How many combinations of
parameter settings? (e.g. users age groups)
•
Practically limited
2. Analyze and interpret data
•
•
Descriptive statistics
• Dealing with variability, outliers
Hypothesis testing: sample vs. population
• Potentially infinite population (e.g. software runs)
• Claims on variable values for population based on
sample variables
•
Statistical significance
3. Data presentation
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
19
A Systematic Approach
1. Run experiments
•
•
How many trials? How many combinations of parameter
settings?
Sensitivity analysis on other parameter values.
2. Analyze and interpret data
•
Statistics, dealing with variability, outliers
3. Data presentation
4. Where does it lead us next?
•
New hypotheses, new questions, a new round of experiments
Statistical Methods in Computer Science
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
22
Experimental Lifecycle
Vague idea
“groping around”
experiences
Initial
observations
Model/
Theory
Data, analysis,
interpretation
Results & final
Presentation
Statistical Methods in Computer Science
Hypothesis
Experiment
© 2006-now Gal Kaminka/ Ido Dagan.
Portions © Carla Ellis at Duke University
23
Download