LicenseName - Software Engineering Laboratory

advertisement
A Preliminary Study on
Impact of Software Licenses on
Copy-and-Paste Reuse
Yu Kashima†,Yasuhiro Hayase††,
Norihiro Yoshida†††,
Yuki Manabe† ,Katsuro Inoue†
†:
Osaka University ††:Toyo University
†††: Nara Institute of Science and Technology
1
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Software Reuse
• Purpose of software reuse
– Development of reliable software
– Increasing software productivity
• We focus on Copy-and-Paste(CnP)
– A basic method of software reuse
2
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Open Source Software
and Licenses
• Open Source Software(OSS)
– Derivative works from OSS products are allowed
to be distributed
– Reusable source code is increasing because of
increasing OSS products
• OSS Licenses
– Many kind of licenses are designed for satisfying
various developer’s intent
– Each OSS licenses have different conditions
– Reuse is also restricted by the licenses
3
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Representative OSS Licenses
• 3-clause BSD License(BSD3)
– A derivative work must retain copyright notices, list of
conditions and disclaimer of warranties
• Apache License Version 2(Apachev2)
– A derivative work must retain copyrights, patents,
trademarks and attribution notices
• GNU General Public License Version 2(GPLv2)
– A derivative work must be distributed under GPLv2
• LicenseName Code ≡ source code distributed
under LicenseName
Ex. BSD3 code ≡ source code distributed under BSD3
4
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
CnP between different license
files
• If a developer reuse source code;
– Both license of reused code and license of
developing code must be satisfied
simultaneously
– Distributions of developing code are
prohibited in case
BSD3
GPLv2
Apachev2
GPLv2
CnP
CnP
CnP
CnP
5
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Impact of License on CnP
• Hypothesis
– Characteristic of source code reuse depends on
their license
• Frequency of CnP
• Kind of licenses used by source code developed by
CnP
• To our knowledge, there are no quantitative
studies on CnP reuse from the aspect of
software license
• We investigate actual OSS to confirm this
hypothesis
6
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Experiment
• An quantitative experiment was performed on
a small set
• Purpose
– Confirming our hypothesis
– Investigating the scalability of our method
• Overview
– Investigation of the number of CnP on each
license
– Code clone detection is used for CnP detection
• Code clone is a code fragment similar to other
• Code clone is typically generated by CnP
7
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Method of Experiment
Application X
License A
License A
Step1.
License
detection
License A
Code fragments
grouped by their
license
Step2.
License B
Code
Clone
Detection
License
#Code
Fragm
ents
License A
10
License B
3
…
…
Step3.
Application Y
License A
License B
Source Files
Counting
Code
Clones
Unknown
8
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Step1. License Detection
• Ninka[1] is used for detecting licenses of
source files
– Analyzing license description in the source file
– Having the high precision of the detected license
• Excluding files Ninka fails to detect their
licenses
– Files which contain no license description or
unknown license description
[1] D. M. German, Y. Manabe and K. Inoue: “A sentence-matching method for automatic license identification of source
code files”, ASE 2010, pp. 437–446 (2010)
9
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Step2. Code Clone Detection
• CCFinder[2] is used for extracting code clone across different
application
– We assume that CnP within application will not cause license problems
• Filtering
– Excluding code clones generated by other than CnP
Ex.
getter/setter, variable declarations
• Directions of CnP are undecided
Application X
Application Y
Application Z
License A
License B
License C
CnP
CnP
Variable Declarations
Getter/Setter
[2] T. Kamiya, S. Kusumoto and K. Inoue: “CCFinder: A multilinguistic token-based code clone detection system
for large scale source code”, IEEE Transactions on Software Engineering, 28, pp. 654–670 (2002)
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
10
Step3. Counting Code
Clones(1/2)
•
Repeating the following steps to target licenses
1. Select a license as an analysis target
2. Extract clone sets including the license code
• Clone set is a set of code clones similar to each
other
3. Count code fragments in extracted clone sets
Fragments having
grouped by their license
CnP relations
Application X
Application Y
License A
License B
Application Z
License C
to License A code
License
#Code
Fragments
License A
2
License B
1
License C
2
11
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Step3. Counting Code
Clones(2/2)
•
A clone set including both original code fragments and code
fragments generated by CnP
→ Counting code fragments in clone sets approximates counting
the number of CnP
•
Counting the number of CnP to/from target license code
fragments
•
Although this table includes the CnP of opposite direction, it is
enough to understand the brief of summary
Application X
Application Y
License A
License B
Application Z
License C
Fragments having
CnP relations
to License A code
License
#Code
Fragments
License A
2
License B
1
License C
2
12
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Analyzed Code
• Java files(.java) in Debian GNU/Linux
5.0.2 main section
• Reasons for selecting this target
– consisted of various licenses
– enable to be analyzed by both Ninka and
CCFinder
– an feasible scale for this experiment
#Packages
#Files
LOC
452
77,452
8,530,896
13
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
License Distribution
in Analyzed Code
#Files
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
14
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Result(BSD3)
• Result of counting code fragments in clone sets including BSD3
fragments grouped by their license
• The frequency of license used by code fragments having CnP
relationship to BSD3 fragments
License
BSD3
#Fragments
Percentage
613
92%
GPLv2+
20
3.0%
Apachev2
16
2.4%
LesserGPL2+
14
2.1%
GPLv2,ClassPathException
1
0.15%
LesserGPL2.1+
1
0.15%
• BSD3 code is mostly reused by BSD3 code
15
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Result(Apachev2)
License
Apachev2
Percentage
#Fragments
1533
77%
316
16%
LesserGPL2.1+
42
2.1%
MPLv1.1
33
1.6%
BSD3
29
1.5%
MX4JLicensev1
16
0.80%
GPLv2+
4
0.20%
LibraryGPL2+
3
0.15%
MPLv1.0
2
0.10%
MITX11noNotice
2
0.10%
Public Domain
1
0.050%
Subversion+
1
0.050%
EPLv1
1
0.050%
Apachev1.1
• Large percentage of CnP
between Apachev2 code
fragments
• Apachev1.1 code has
been changed their license
to Apachev2
16
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Result(GPLv2+)
License
#Fragments
Percentage
GPLv2+
268
44%
GPLnoVersion,GPLv2+,LinkException
225
41%
BSD3
28
5.1%
LibraryGPLv2+
20
3.6%
Apachev2
4
0.73%
LesserGPLv2.1+
4
0.73%
• CnP within GPLv2+ code occupy the highest percentage
• “GPLnoVersion, GPLv2+, LinkException” has high percentage
• “GPLnoVersion, GPLv2+, LinkException” code is reused by GPLv2+ code.
GPLnoVersion, GPLv2+, LinkException
GPLv2+
CnP
CnP
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
17
#Files and #Fragments under
Each License
• Code
under a license is copy-and-pasted frequently,
if “#Fragments / #Files” of the license is large
#Fragments
BSD3
Apachev2
GPLv2+
#Files
#Fragments / #Files
665
2181
0.305
1983
16350
0.121
549
8160
0.0673
• The frequency of CnP per file
BSD3 > Apachev2 > GPLv2+
18
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Summary of the Results
• Common characteristic of all licenses
– CnP within code distributed under same license or
licenses designed by the same organization have a
majority
• CnP might happen mostly in an organization
• Apachev2 has CnP relations to various licenses
– Files under Apachev2 have the largest number
– The condition of Apachev2 is more relaxed than
that of GPLv2+
• The frequency of CnP per file
BSD3 > Apachev2 > GPLv2+
19
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Threat to Validity
• Insufficient to apply this result to general OSS
– This analysis target is small
→ We plan large scale analysis
– Only Java files were analyzed
• History of Java files is short, hence Java files are less copyand-pasted than others
→ We plan analysis of C/C++ files
• Overlap code fragments may be counted separately
– Number of overlap code fragments might be small
Fragment A
Fragment B
20
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Scalability of Investigating
Method
• This method can apply to large target,
because each step can
– License detection
• Ninka can analyze files in linear order
– Code clone detection
• There are more scalable tools than CCFinder such
as CCFinderX and D-CCFinder.
– Counting code clone
• This process did not take a long time
21
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Conclusion
• A preliminary study of impact of licenses on
CnP was performed
– Java files in Debian/GNU Linux 5.0.2 main
section were analyzed
• CnP are happened mostly within code
distributed under the same license or
licenses designed by the same organization
• The frequency of CnP per file
– BSD3 > Apachev2 > GPLv2+
• Our method can be applied to a large target
22
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Future Work
• Large Scale Experiment
• Investigating that code fragments are
copy-and-pasted mostly in an organization
• Detecting direction of CnP
23
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Download