slides

advertisement
GPLAG: Detection of Software
Plagiarism by Program
Dependence Graph Analysis
Chao Liu, Chen Chen,
Jiawei Han, Philip S. Yu
University of Illinois at Urbana-Champaign
IBM T.J. Waston Research Center
Presented by Chao Liu
1
Motivations
 Blossom of open-source projects
 SourceForge.net: 125,090 projects as July
2006
 Convenience for software plagiarism?
 You can always find something online
 Core-part plagiarism
 Ripping off GUIs and irrelevant parts
 (Illegally) reuse the implementations of corealgorithms
 Our goal
 Efficient detection of core-part plagiarism
2
Challenges
 Effectiveness
 Professional plagiarists
 Automated plagiarism
 Efficiency
 Only a small part of code is plagiarized, how
to detect it efficiently?
3
Outline
 Plagiarism Disguises
 Review of Plagiarism Detection
 GPLAG: PDG-based Plagiarism Detection
 Efficiency and Scalability
 Experiments
 Conclusions
4
Original Program
A procedure in a program, called join
01
static void
02
make_blank (struct line *blank, int count)
03
{
04
int i;
05
unsigned char *buffer;
06
struct field *fields;
07
blank->nfields = count;
08
blank->buf.size = blank->buf.length = count + 1;
09
blank->buf.buffer = (char*) xmalloc (blank->buf.size);
10
buffer = (unsigned char *) blank->buf.buffer;
11
blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count);
12
for (i = 0; i < count; i++){
13
...
14
15
}
}
5
Disguise 1: Format Alteration
Insert comments and blanks
01
static void
02
make_blank (struct line *blank, int count)
03
{
04
int i;
05
unsigned char *buffer;
06
struct field *fields;
07
blank->nfields = count; // initialization
08
blank->buf.size = blank->buf.length = count + 1;
09
blank->buf.buffer = (char*) xmalloc (blank->buf.size);
10
buffer = (unsigned char *) blank->buf.buffer;
11
blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count);
12
for (i = 0; i < count; i++){
13
...
14
15
}
}
6
Disguise 2: Identifier Renaming
Rename variables consistently
01
static void
02
fill_content (struct line *fill, int num)
03
{
04
int i;
05
unsigned char *buffer;
06
struct field *fields;
07
fill->nfields = num; // initialization
08
fill->buf.size = fill->buf.length = num + 1;
09
fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10
buffer = (unsigned char *) fill->buf.buffer;
11
fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
12
for (i = 0; i < num; i++){
13
...
14
15
}
}
7
Disguise 3: Statement Reordering
Reorder non-dependent statements
01
static void
02
fill_content (struct line *fill, int num)
03
{
04
int i;
05
unsigned char *buffer;
06
struct field *fields;
11
fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
08
fill->buf.size = fill->buf.length = num + 1;
09
fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10
buffer = (unsigned char *) fill->buf.buffer;
07
fill->nfields = num; // initialization
12
for (i = 0; i < num; i++){
13
...
14
15
}
}
8
Disguise 4: Control Replacement
Use equivalent control structure
01
static void
02
fill_content (struct line *fill, int num)
03
{
04
int i;
05
unsigned char *buffer;
06
struct field *fields;
11
fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
08
fill->buf.size = fill->buf.length = num + 1;
09
fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10
buffer = (unsigned char *) fill->buf.buffer;
07
fill->nfields = num; // initialization
12
i = 0;
13
while (i < num){
14
...
15
i++;
16
17
}
}
9
Disguise 5: Code Insertion
Insert immaterial code
01
static void
02
fill_content (struct line *fill, int num)
03
{
04
int i;
05
unsigned char *buffer;
06
struct field *fields;
11
fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
08
fill->buf.size = fill->buf.length = num + 1;
09
fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10
buffer = (unsigned char *) fill->buf.buffer;
07
fill->nfields = num; // initialization
12
i = 0;
13
while (i < num){
14
... for (int j = 0; j < i; j++);
15
i++;
16
17
}
}
10
Fully Disguised
01 static void
02 make_blank (struct line *blank, int count)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
07
08
09
10
11
12
13
14
15
01 static void
02 fill_content(int num, struct line* fill)
03 {
04 (*fill).store.size = fill->store.length = num + 1;
05 struct field *tabs;
06 (*fill).fields = tabs = (struct field *)
xmalloc (sizeof (struct field) * num);
07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);
08 (*fill).ntabs = num;
09 unsigned char *pb;
10 pb = (unsigned char *) (*fill).store.buffer;
blank->nfields = count;
blank->buf.size = blank->buf.length = count + 1;
blank->buf.buffer = (char*) xmalloc (blank->buf.size);
buffer = (unsigned char *) blank->buf.buffer;
blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count); 11 int idx = 0;
12 while(idx < num){ // fill in the storage
for (i = 0; i < count; i++){
13
...
...
14
for(int j = 0; j < idx; j++)
}
15
...
}
16
idx++;
17 }
18 }
Original Code
Plagiarized Code
11
Outline






Plagiarism Disguises
Review of Plagiarism Detection
GPLAG: PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions
12
Review of Plagiarism Detection
 String-based [Baker et al. 1995]


A program represented as a string
Blanks and comments ignored.
 AST-based [Baxter et al. 1998, Kontogiannis et al. 1995]


A program is represented as an Abstract Syntax Tree (AST)
Fragile to statement reordering, control replacement and
code insertion
 Token-based [Kamiya et al. 2002, Prechelt et al. 2002]





Variables of the same type are mapped to the same token
A program is represented as a token string
Fingerprint of token strings is used for robustness [Schleimer
et al. 2003]
Partially robust to statement reordering, control replacement
and code insertion
Representatives: Moss and JPlag
13
Outline






Plagiarism Disguises
Review of Plagiarism Detection
GPLAG: PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions
14
Graphic representation of source code
int sum(int array[], int
count)
{
int i, sum;
sum = 0;
for(i = 0; i < count; i++){
sum = add(sum, array[i]);
}
return sum;
}
int add(int a, int b)
{
return a + b;
}
15
Graphic representation of source code
int sum(int array[], int count)
int add(int a, int b)
{
int i, sum;
{
return a + b;
sum = 0;
for(i = 0; i < count; i++){
}
sum = add(sum, array[i]);
}
return sum;
}
16
Control Dependency
int sum(int array[], int count)
int add(int a, int b)
{
int i, sum;
{
return a + b;
sum = 0;
for(i = 0; i < count; i++){
}
sum = add(sum, array[i]);
}
return sum;
}
17
Data Dependency
int sum(int array[], int count)
int add(int a, int b)
{
int i, sum;
{
return a + b;
sum = 0;
for(i = 0; i < count; i++){
}
sum = add(sum, array[i]);
}
return sum;
}
18
Plagiarism Detectible?
01 static void
02 make_blank (struct line *blank, int count)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
07
08
09
10
11
12
13
14
15
01 static void
02 fill_content(int num, struct line* fill)
03 {
04 (*fill).store.size = fill->store.length = num + 1;
05 struct field *tabs;
06 (*fill).fields = tabs = (struct field *)
xmalloc (sizeof (struct field) * num);
07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);
08 (*fill).ntabs = num;
09 unsigned char *pb;
10 pb = (unsigned char *) (*fill).store.buffer;
blank->nfields = count;
blank->buf.size = blank->buf.length = count + 1;
blank->buf.buffer = (char*) xmalloc (blank->buf.size);
buffer = (unsigned char *) blank->buf.buffer;
blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count); 11 int idx = 0;
12 while(idx < num){ // fill in the storage
for (i = 0; i < count; i++){
13
...
...
14
for(int j = 0; j < idx; j++)
}
15
...
}
16
idx++;
17 }
18 }
Original Code
Plagiarized Code
19
Corresponding PDGs
8: decl.,
int count
3: decl.,
line* blank
12: decl.,
int i
0: assign,
blank->fields =
fields = ...
2: call-site,
xmalloc()
8: decl.,
int num
12: decl.,
int idx
16: decl.,
int j
13: assign,
idx = 0
17: assign,
j=0
0: assign,
(*field).fields =
tab = ...
4: assign,
blank->buf.buff
er = (chai*) xm..
1: assign,
fields =
(struct ...
3: decl.,
line* fill
5: decl.,
struct field*
fields
6: call-site,
xmalloc()
7: assign,
blank->nfields =
count
9: assign,
blank->buf.size
= blank->...
10: assign, buffer
= (unsigned) ...
11: decl.,
char* buffer
13: assign,
i=0
14: inc.,
i++
15: control
i < count
PDG for the Original Code
4: assign,
(*fill).store.buf =
(char*) ...
1: assign,
tabs = (struct
...
2: call-site,
xmalloc()
5: decl.,
struct field*
tabs
6: call-site,
xmalloc()
7: assign,
(*fill).ntabs =
num
9: assign,
(*fill).store.size
= ...
10: assign, pb =
(unsigned
char*) (*fill)...
11: decl.,
char* pb
14: inc.,
idx++
15: control
while(idx < num)
18: inc.,
j++
19: control
j < idx
PDG for the Plagiarized Code
20
PDG-based Plagiarism Detection
 A program is represented as a set of PDGs
 Let g be a PDG of Procedure P in the original program
 Let g’ be a PDG of Procedure P’ in the plagiarism suspect
 Subgraph isomorphism implies plagiarism
 If g is subgraph isomorphic to g’, P’ is likely plagiarized
from P
 γ-isomorphism: Graph g is γ-isomorphic to g’ if there
exists a subgraph s of g such that s is subgraph
isomorphic to g’, and |s|≥ γ |g|.
 If g is γ–isomorphic to g’, the PDG pair (g, g’) is
regarded as a plagiarized PDG pair, and is then returned
to human beings for examination.
21
Advantages
 Robust because it is hard to overhaul PDGs
 Dependencies encode program logic
 Incentive of plagiarism
22
Outline






Plagiarism Disguises
Review of Plagiarism Detection
GPLAG: PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions
23
Efficiency and Scalability
 Search space
 If the original program has n procedures
and the plagiarism suspect has m
procedures
 n*m subgraph isomorphism testings
 Pruning search space
 Lossless filter
 Statistical lossy filter
24
Lossless filter
 Interestingness
 PDGs smaller than an interesting
size K are excluded from both
sides
 γ-isomorphism definition
 A PDG pair (g, g’) is discarded if
|g’| <γ|g|.
25
Lossy Filter
 Observation
 If procedure P’ is plagiarized from
procedure P, its PDG g’ should look
similar to g.
 So discard those dissimilar PDG pairs
 Requirement
 This filter must be light-weighted
26
Vertex Histogram
 Represent PDG g by
h(g) = (n1, n2, …, nk),
where ni is the frequency of the ith kind of
vertices.
 Similarly, represent PDG g’ by
h(g’) = (m1, m2, …, mk).
 Direct similarity measurement?
 How to define a proper similarity threshold?
 Is thus defined threshold programindependent?
27
Hypothesis Testing-based Approach
 Basic idea
 Estimate a k-dimensional multinomial
distribution
from h(g)
 Test whether h(g’) is likely an
observation from
 If it is, g’ looks similar to g, and an
isomorphism testing is needed.
 Otherwise, (g, g’) is discarded
28
Technical Details
29
Technical Details (cont’d)
30
Work-flow of GPLAG
 PDGs are
generated with
Codesurfer
 Isomorphism
testing is
implemented
with VFLib.
31
Outline






Plagiarism Disguises
Review of Plagiarism Detection
GPLAG: PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions
32
Experiment Design
 Subject programs
 Effectiveness
 Filter efficiency
 Core-part plagiarism detection
33
Effectiveness



2-hour manual plagiarism, but can be automated?
GPLAG detects all plagiarized PDG pairs within 1 second
PDG isomorphism also reveals what plagiarism disguises are applied
34
Efficiency
 Subject programs
 bc, less and tar.
 Exact copy as plagiarism.
 Lossless and lossy filter
 Pruning PDG-pairs.
 Implication to overall time cost.
35
Pruning Uninteresting PDG-pairs
 Lossless only
 Lossless and
lossy
36
Implication to Overall Time Cost
 Time-out for subgraph isomorphism testing, time hogs.
 Lossless filter does not save much time.
 Lossy filter significantly reduces the time cost.
 Major time saving comes from the avoidance of time hogs.
37
Detection of Core-part Plagiarism
 Lower time cost with lossy filter.
 Lower false positives with lossy filter.
38
Outline






Plagiarism Disguises
Review of Plagiarism Detection
GPLAG: PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions
39
Conclusions
 We developed a new algorithm GPLAG for
software plagiarism detection
 It is more effective to fight against
“professional” plagiarists
 We developed a statistical lossy filter, which
improves the efficiency of GPLAG
 We experimentally verified the effectiveness
and efficiency of GPLAG
40
Q & A
Thank You!
41
References
[1] B. S. Baker. On finding duplication and near duplication in large software
systems. In Proc. of 2nd Working Conf. on Reverse Engineering, 1995.
[2] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection
using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance,
1998.
[3] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using
patterns. In Working Notes of 3rd Workshop on AI and Software
Engineering, 1995.
[4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic tokenbased code clone detection system for large scale source code. IEEE Trans.
Softw. Eng., 28(7), 2002.
[5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set
of programs with JPlag. J. of Universal Computer Science, 8(11), 2002.
[6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms
for document fingerprinting. SIGMOD, 2003.
[7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error
patterns by mining software revision histories. In Proc. of 13th Int. Symp.
on the Foundations of Software Engineering, 2005.
[8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error
isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006.
[9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs
for ”backtrace” of noncrashing bugs. In SDM, 2005.
42
Download