eval-darpa-programs

advertisement

Some Lessons for Evaluators of DARPA Programs

Paul Cohen

Computer Science

School of Information: Science, Technology and Arts

University of Arizona

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Shameless plug

Textbook, MIT Press, 1995

Other material:

Empical Methods Tutorial at the Pacific Rim AI Conference, 2008

Assessing the Intelligence of Cognitive Decathletes . Paul Cohen.

Presented at the NIST Workshop on Cognitive Decathlon.

Washington DC. January 2006.

If Not the Turing Test, Then What?

Paul Cohen. Invited Talk at the

National Conference on Artificial Intelligence. July, 2004.

Various papers on empirical methods.

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Outline

• Some general lessons about how to conduct evaluations of

DARPA programs

• Some specific methodological lessons that every DARPA program manager should know – illustrated with a case study of a large IPTO program evaluation

• A checklist for evaluation designs

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

General lessons from DARPA program evaluations

• All DARPA program evaluations serve three masters: The director, the program manager, and the research(ers).

• A well-designed evaluation gives these stakeholders what they need, but compromise is necessary and the evaluator should broker it

• The evaluator is not there to trip up the performer, but to design a test that can be passed. Whether it is passed is up to the performer.

• Start early. Ideally, the program claims, protocols and metrics are ready before the BAA/solicitation is even released.

• Keep the claims simple, but make sure there are claims

Write (no Powerpoint!) the protocol, including claims, materials and subjects, method, planned analyses and expected results

• Run pilot experiments. Really. It's too expensive not to. Really. I mean it.

• Provide adequate infrastructure for the experiments. Don’t be cheap.

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

General lessons from DARPA program evaluations

• You are spending tens of millions on the program, so require the evaluation to provide more than one bit (pass/fail) of information (Lesson 5, below: demos are good, explanations better; or as Tony Tether said, “passing the test is necessary but not sufficient for continued funding.”)

• Stay flexible: Multi-year programs that test the same thing each year quickly become ossified. Review and refine claims (metrics, protocol...) annually.

• Stay flexible II: Let some parameters of the evaluation (e.g., number of subjects or test items) be set pragmatically and don’t freak if they change.

• Stay flexible III: Avoid methodological purists. Any fool can tell you why something is “not allowed” or your “sample size is wrong,” etc. A good evaluator finds workarounds and quantifies confidence.

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Some methodological lessons that every

DARPA program manager should know

1. Evaluation begins with claims; metrics without claims are meaningless

2. The task of empirical science is to explain variability

3. Humans are great sources of variability

4. Of sample variance, effect size, and sample size, control the first before touching the last

5. Demonstrations are good, explanations are better

6. Most explanations involve additional factors; most interesting science is about interaction effects, not main effects

7. Exploratory Data Analysis: use your eyes to look for explanations in data

8. Not all studies are experiments, not all analysis hypothesis testing;

9. Significant and meaningful are not synonyms

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 1: Evaluation begins with claims

• The most important, most immediate and most neglected part of evaluation plans.

• What you measure depends on what you want to know, on what you claim.

• Claims:

– X is bigger/faster/stronger than Y

– X varies linearly with Y in the range we care about

– X and Y agree on most test items

– It doesn't matter who uses the system (no effects of subjects)

– My algorithm scales better than yours (e.g., a relationship between size and runtime depends on the algorithm)

• Non-claim: I built it and it runs fine on some test data

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

The team claims that its system performance is due to learned knowledge

Learning that chooses its own features

Hybrid learning methods

New methods

Learning by advice

Perceptual learning

Learning by example

Learning over diverse features

Learning relations

Common experimental environment

System that supports

Integrated Learning

Knowledge Base

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Learning to put email in the right folders

Subjects' mail Subjects' mail folders

Training Testing

REL

KB

SVM

Three learning methods

Compare to get classification accuracy

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 2: The task of empirical science is to explain variability

Lesson 3: Humans are a great source of variability

Classification accuracy

Number of training instances

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 2: The task of empirical science is to explain variability

Lesson 3: Humans are a great source of variability

Classification accuracy

Number of training instances

Why do you need statistics?

When something obviously works, you don't need statistics

When something obviously fails, you don't need statistics

Statistics is about the ambiguous cases, where things don't obviously work or fail.

Ambiguity is generally caused by variance, some variance is caused by lack of control

If you don't get control in your experiment design, you try to supply it post hoc with statistics

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Accuracy

Accuracy vs. Training Set Size

Averaged over subject

No differences are significant

REL

KB

SV

M

100 150 200 250 300 350 400 450

Training set size

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

500 ≥550

Accuracy vs. Training Set Size

(100% Coverage, Grouped)

Accuracy

RE

L

KB

SVM

No differences are significant

100 - 200 250 - 400 450 - 750

Number of training instances

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Why are things not significantly different?

Lesson 6: Most explanations involve additional factors

Means are close together and variance is high

Means are far apart but variance is high

Why is variance high? Your experiment looks at X1, the algorithm, and Y, the score, but there is usually an X2 lurking which contributes to variance

Lesson 2: The task of empirical science is to explain variability. Find and control X2!

X2 =

X1=REL

X1=KB

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 7: Exploratory Data Analysis means your eyes to look for explanations in data

Accuracy

Which contributes more to variance in accuracy scores:

Subject or Algorithm?

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

7) EDA: use your eyes to look for explanations in data

Classification accuracy

• Three categories of “errors” identified

– Mis-foldered (drag-and-drop error)

– Non-stationary (wouldn’t have put it there now)

– Ambiguous (could have been in other folders)

• Users found that 40% – 55% of their messages fell into one of these categories

Number of training instances

Subject Folders Messages

1

2

3

15

15

38

268

777

646

Mis-

Foldered

1%

1%

0%

Non-

Stationary Ambiguous

13%

24%

7%

42%

16%

33%

EDA tells us the problem: We're trying to find differences between algorithms when the gold standards are themselves errorful – but in different ways, increasing variance!

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 4: Of sample variance, effect size, and sample size, control the first before touching the last t = x

1

– x

2 s 2

N

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 4: Of sample variance, effect size, and sample size, control the first before touching the last

Subtract Alg1 from Alg2 for each subject, i.e., look at difference scores, correcting for variability of subjects

"matched pair" test

0.3

0.2

0.1

0

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

5 10

Significant difference having controlled variance due to subjects

Accuracy

RE

L

KB n.s.

SVM n.s.

100 - 200 250 - 400 450 - 750

Number of training instances

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Accuracy

100 - 200 n.s.

Lesson 5: Demonstrations are good; explanations better n.s.

REL

KB

SVM

Having demonstrated that one algorithm is better than another we still can't explain:

• Why is it better? Is it something to do with the task or a general result?

• Why is it not better at all levels of training? Is it an artefact of the analysis or a repeatable phenomenon?

• Why does the REL curve look straight, unlike conventional learning curves?

These and other questions tell us we have demonstrated but not explained an effect; we don't know much about it.

250 - 400

Number of training instances

450 - 750

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Accuracy

Lesson 8: Not all studies are experiments, not all analyses are hypothesis testing

REL

KB

SVM

The purpose of the study might have been to model the rate of learning

Modeling also involves statistics, but a different kind: Degree of fit, percentage of variance accounted for, linear and nonlinear models…

100 - 200 250 - 400

Number of training instances

450 - 750

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 9: Significant and meaningful are not synonyms w

2

= s

2

?

s s

2

?|Algorithm

2

?

w

2

= t

2

1 t 2 + N 1

Reduction in uncertainty due to knowing Algorithm

Estimate of reduction in variance w

2

.192 KB vs SVM

.336 REL vs SVM

.347 REL vs KB

For "fully trained" algorithms (≥ 500 training instances)

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Lesson 6: Most interesting science is about interaction effects, not main effects

System’s performance improves at a greater rate when learned knowledge is included than when only engineered knowledge is included.

Learned knowledge begets learned knowlege

The lines aren’t parallel: The effect of development effort (horizontal axis) is different for the learning system than for the nonlearning system.

Interaction effect!

Performance

System with learned knowledge

Y2

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Y3

Systemw/o learned knowledge

Y4

Y5

Review of lessons every DARPA program manager needs to know

1. Evaluation begins with claims; metrics without claims are meaningless

2. The task of empirical science is to explain variability

3. Humans are great sources of variability

4. Of sample variance, effect size, and sample size, control the first before touching the last

5. Demonstrations are good, explanations are better

6. Most explanations involve additional factors; most interesting science is about interaction effects, not main effects

7. Exploratory Data Analysis: use your eyes to look for explanations in data

8. Not all studies are experiments, not all analysis hypothesis testing;

9. Significant and meaningful are not synonyms

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Checklist for evaluation design

• What are the claims? What are you testing, and why?

• What is the experiment protocol or procedure? What are the factors

(independent variables), what are the metrics (dependent variables)?

What are the conditions, which is the control condition?

• Sketch a sample data table. Does the protocol provide the data you need to test your claim? Does it provide data you don't need? Are the data the right kind (e.g., real-valued quantities, frequencies, counts, ranks, etc.) for the analysis you have in mind?

• Sketch the data analysis and representative results. What will the data look like if they support / don't support your conjecture?

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Checklist for evaluation design, cont.

• Consider possible results and their interpretation. For each way the analysis might turn out, construct an interpretation. A good experiment design provides useful data in "all directions" – pro or con your claims

• Ask yourself again, what was the question? It's easy to get carried away designing an experiment and lose the big picture

• Is everyone satisfied? Are all the stakeholders in the evaluation going to get what they need?

• Run a pilot experiment to calibrate parameters

© Paul Cohen. School of Information: Science, Technology, and Arts. University of Arizona

Download