A Human Study of Patch Maintainability

advertisement
A HUMAN STUDY OF
PATCH MAINTAINABILITY
Zachary P. Fry, Bryan Landau, Westley Weimer
University of Virginia
{zpf5a,bal2ag,weimer}@virginia.edu
Bug Fixing
2
Fixing bugs manually is difficult and costly.
 Recent techniques explore automated patches:

 Evolutionary
techniques – GenProg
 Dynamic modification – ClearView
 Enforcement of pre/post-conditions – AutoFix-E
 Program transformation via static analysis – AFix

While these techniques save developers time,
there is some concern as to whether the
patches produced are human-understandable
and maintainable in the long run.
Questions Moving Forward
3
How can we concretely measure these notions
of human understandability and future
maintainability?
 Can we automatically augment machinegenerated patches to improve maintainability?
 In practice, are machine-generated patches as
maintainable as human-generated patches?

Questions Moving Forward
4
How can we concretely measure these notions
of human understandability and future
maintainability?
 Can we automatically augment machinegenerated patches to improve maintainability?
 In practice, are machine-generated patches as
maintainable as human-generated patches?

Measuring quality and maintainability
5

Functional Quality – Does the implementation
match the specification?
 Does

the code execute “correctly”?
Non-functional Quality – Is the code
understandable to humans?
 How
✓
difficult is it to understand and alter the code
in the future?
?
Software Functional Quality
6

Perfect:
 Implementation

matches specification
Direct software quality metrics:
 Testing
 Defect
density
 Mean time to failure

Indirect software quality metrics:
 Cyclomatic
complexity
 Coupling and cohesion (CK metrics)
 Software readability
Software Non-functional Quality
7

Maintainability:
 Human-centric
factors affecting the ease with which
bugs can be fixed and features can be added
 Broadly related to the “understandability” of code
 Not easy to concretely measure with heuristics like
functional correctness
 These automatically-generated patches have been
shown to be of high quality functionally – what
about non-functionally?
Patch Maintainability Defined
8
Rather than using an approximation to measure
understandability, we will directly measure
humans’ abilities to perform maintenance tasks
 Task: ask human participants questions that
require them to read and understand a piece
of code and measure the effort required to
provide correct answers
 Simulate the maintenance process as closely as
possible

Php Bug #54454
9
Title: “substr_compare incorrectly reports
equality in some cases”
 Bug description:

 “if main_str is
shorter than str, substr_compare
[mistakenly] checks only up to the length of
main_str”

substr_compare(“cat”, “catapult”) = true
Motivating Example
10
if (offset >= s1_len) {
php_error_docref(NULL TSRMLS_CC,
E_WARNING, "The start position
cannot exceed string length");
RETURN_FALSE;
}
if (len > s1_len - offset) {
len = s1_len - offset;
}
cmp_len = (uint) (len ? len : MAX(s2_len,
(s1_len - offset)));
Motivating Example
11
len--;
if (mode & 2) {
for (i = len - 1; i >= 0; i--) {
if (mask[(unsigned char)c[i]]) {
len--;
}
else {
break; }
}
}
if (return_value) {
RETVAL_STRINGL(c, len, 1);
} else {
Automatic Documentation
12
Intuitions suggest that patches augmented with
documentation are more maintainable
 Human patches can contain comments with hints as
to the developer’s intention when changing code

 Automatic
approaches cannot easily reason about why
a change is made, but can describe what was changed

Automatically Synthesized Documentation:
 DeltaDoc
(Buse et al. ASE 2010)
 Measures semantic program changes
 Outputs natural language descriptions of changes
Automatic Documentation
13
if (!con->conditional_is_valid[dc->comp]) {
if (con->conf.log_condition_handling) {
TRACE("cond[%d] is valid: %d", dc->comp,
con->conditional_is_valid[dc->comp]);
}
/* If not con->conditional_is_valid[dc->comp]
No longer return COND_RESULT_UNSET; */
return COND_RESULT_UNSET;
}
/* pass the rules */
switch (dc->comp) {
case COMP_HTTP_HOST: {
char *ck_colon = NULL, *val_colon = NULL;
Questions Moving Forward
14
How can we concretely measure these notions
of human understandability and future
maintainability?
 Can we automatically augment machinegenerated patches to improve maintainability?
 In practice, are machine-generated patches as
maintainable as human-generated patches?

Evaluation
15
Focused research questions to answer:
 1)
How do different types of patches affect
maintainability?
 2) Which source code characteristics are predictive of
our maintainability measurements?
 3) Do participants’ intuitions about maintainability and its
causes agree with measured maintainability?

To answer these questions directly we performed a
human study using over 150 participants with real
patches from existing systems
Experiment - Subject Patches
16

We used patches from six benchmarks over a
variety subject domains
Program
LOC
Defects
Patches
491,083
1
2
libtiff
77,258
7
14
lighttpd
61,528
3
4
1,046,421
9
17
407,917
1
2
wireshark
2,812,340
11
11
Total:
4,896,547
32
50
gzip
php
python
Experiment - Subject Patches
17





Original – the defective, un-patched code used as
a baseline for measuring relative changes
Human-Accepted – human patches that have not
been reverted to date
Human-Reverted – human-created patches that
were later reverted
Machine – automatically-generated patches
created by the GenProg tool
Machine+Doc – the same patches as above, but
augmented with automatically synthesized
documentation
Experiment – Maintenance Task
18
Sillito et al. – “Questions programmers ask during
software evolution tasks”
 Recorded and categorized the questions
developers actually asked while performing real
maintenance tasks
 “What is the value of the variable “y” on line X?”
 Not: “Does this type have any siblings in the type
hierarchy?”

Human Study
19
…
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
…
if (dc->prev) {
if (con->conf.log_condition_handling) {
log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key);
}
/* make sure prev is checked first */
config_check_cond_cached(srv, con, dc->prev);
/* one of prev set me to FALSE */
if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) {
return COND_RESULT_FALSE;
}
}
if (!con->conditional_is_valid[dc->comp]) {
if (con->conf.log_condition_handling) {
TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]);
}
return COND_RESULT_UNSET;
}
Human Study
20

Question presentation
Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line
33? (recall, you can use inequality symbols in your answer)
Answer to the Question Above:
Human Study
21
…
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
…
if (dc->prev) {
if (con->conf.log_condition_handling) {
log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key);
}
/* make sure prev is checked first */
config_check_cond_cached(srv, con, dc->prev);
/* one of prev set me to FALSE */
if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) {
return COND_RESULT_FALSE;
}
}
if (!con->conditional_is_valid[dc->comp]) {
if (con->conf.log_condition_handling) {
TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]);
}
return COND_RESULT_UNSET;
}
Human Study
22

Question presentation
Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line
33? (recall, you can use inequality symbols in your answer)
Answer to the Question Above:
False
Evaluation Metrics
23



Correctness – is the right answer reported?
Time – what is the “maintenance effort” associated
with understanding this code?
We favor correctness over time
 Participants
were instructed to spend as much time as they
deemed necessary to correctly answer the questions
 The percentages of correct answers over all types of
patches were not different in a statistically significant way
 We focus on time, as it is an analog for the software
engineering effort associated with program understanding
Type of Patch vs. Maintainability
Percent Time Saved for Correct Answers
When Compared with Original Code
24
15
10
5
0
-5
-10
-15
-20
-25
Human
Accepted
Machine
Human
Reverted
Machine+Doc
Patch Type

Effort = average number of minutes it took participants to report a correct
answer for all patches of a given type relative to the original code
Type of Patch vs. Maintainability
Percent Time Saved for Correct Answers
When Compared with Original Code
25
15
10
5
0
-5
-10
-15
-20
-25
Human
Accepted
Machine
Human
Reverted
Machine+Doc
Patch Type

Effort = average number of minutes it took participants to report a correct
answer for all patches of a given type relative to the original code
Characteristics of Maintainability
26
We measured various code features for all
patches used in the human study
 Using a logistic regression model, we can
predict human accuracy when answering the
questions in the study 73.16% of the time
 A Principle Component Analysis shows that 17
features account for 90% of the variance in
the data

 Modeling
maintainability is a complex problem
Characteristics of Maintainability
27
Code Feature
Predictive Power
Ratio of variable uses per assignment
0.178
Code readability
0.157
Ratio of variables declared out of scope vs. in scope
0.146
Number of total tokens
0.097
Number of non-whitespace characters
0.090
Number of macro uses
0.080
Average token length
0.078
Average line length
0.072
Number of conditionals
0.070
Number of variable declarations or assignments
0.056
Maximum conditional clauses on any path
0.055
Number of blank lines
0.054
Human Intuition vs. Measurement
28

After completing the study, participants were asked
to report which code features they thought increased
maintainability the most
Human Reported Feature
Votes
Predictive Power
Descriptive variable names
35
*0.000
Clear whitespace and indentation
25
*0.003
Presence of comments
25
0.022
Shorter function
8
*0.000
Presence of nested conditionals
8
0.033
Presence of compiler directives / macros
7
0.080
Presence of global variables
5
0.146
Use of goto statements
5
*0.000
Lack of conditional complexity
5
0.055
Uniform use and format of curly braces
5
0.014
Conclusions
29

From conducting a human study involving over
150 participants and patches fixing highpriority defects from real systems we conclude:
 The
fact that humans take less time, on average, to
answer questions about machine-generated
patches with automated documentation than
human-created patches validates the possibility of
using automatic patch generation techniques in
practice
 There is a strong disparity between human intuitions
about maintainability and our measurements and
thus we think further study is merited in this area
30

Questions?
Modified DeltaDoc
31

We modify DeltaDoc in the following ways:
 Include
all changes, regardless of length of output
 Ignore all internal optimizations that lead to loss of
information (e.g. ignore suspected unrelated
statements)
 Include all relevant programmatic information (e.g.
function arguments)
 Ignore all high-level output optimizations
Favor comprehensive explanations over brevity
 Insert output directly above patches as comments

Experiment - Participants
32

Over 150 participants
 27
fourth-year undergraduate CS students
 14 CS graduate students
 116 Mechanical Turk internet participants

Accuracy cutoff imposed
 Ensuring
people don’t try to “game the system”
requires special consideration
 Any participant who failed to answer all questions
or scored below one standard deviation of the
average undergraduate student’s score was
removed
Experiment - Questions
33





What conditions must hold to always reach line X
during normal execution?
What is the value of the variable “y” on line X?
What conditions must be true for the function “z()” to
be called on line X?
At line X, which variables must be in scope?
Given the following values for relevant variables,
what lines are executed by beginning at line X? Y=5
&& Z=True.
Download