A Preliminary Study on Impact of Software Licenses on Copy-and-Paste Reuse Yu Kashima†,Yasuhiro Hayase††, Norihiro Yoshida†††, Yuki Manabe† ,Katsuro Inoue† †: Osaka University ††:Toyo University †††: Nara Institute of Science and Technology 1 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Software Reuse • Purpose of software reuse – Development of reliable software – Increasing software productivity • We focus on Copy-and-Paste(CnP) – A basic method of software reuse 2 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Open Source Software and Licenses • Open Source Software(OSS) – Derivative works from OSS products are allowed to be distributed – Reusable source code is increasing because of increasing OSS products • OSS Licenses – Many kind of licenses are designed for satisfying various developer’s intent – Each OSS licenses have different conditions – Reuse is also restricted by the licenses 3 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Representative OSS Licenses • 3-clause BSD License(BSD3) – A derivative work must retain copyright notices, list of conditions and disclaimer of warranties • Apache License Version 2(Apachev2) – A derivative work must retain copyrights, patents, trademarks and attribution notices • GNU General Public License Version 2(GPLv2) – A derivative work must be distributed under GPLv2 • LicenseName Code ≡ source code distributed under LicenseName Ex. BSD3 code ≡ source code distributed under BSD3 4 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University CnP between different license files • If a developer reuse source code; – Both license of reused code and license of developing code must be satisfied simultaneously – Distributions of developing code are prohibited in case BSD3 GPLv2 Apachev2 GPLv2 CnP CnP CnP CnP 5 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Impact of License on CnP • Hypothesis – Characteristic of source code reuse depends on their license • Frequency of CnP • Kind of licenses used by source code developed by CnP • To our knowledge, there are no quantitative studies on CnP reuse from the aspect of software license • We investigate actual OSS to confirm this hypothesis 6 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Experiment • An quantitative experiment was performed on a small set • Purpose – Confirming our hypothesis – Investigating the scalability of our method • Overview – Investigation of the number of CnP on each license – Code clone detection is used for CnP detection • Code clone is a code fragment similar to other • Code clone is typically generated by CnP 7 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Method of Experiment Application X License A License A Step1. License detection License A Code fragments grouped by their license Step2. License B Code Clone Detection License #Code Fragm ents License A 10 License B 3 … … Step3. Application Y License A License B Source Files Counting Code Clones Unknown 8 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Step1. License Detection • Ninka[1] is used for detecting licenses of source files – Analyzing license description in the source file – Having the high precision of the detected license • Excluding files Ninka fails to detect their licenses – Files which contain no license description or unknown license description [1] D. M. German, Y. Manabe and K. Inoue: “A sentence-matching method for automatic license identification of source code files”, ASE 2010, pp. 437–446 (2010) 9 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Step2. Code Clone Detection • CCFinder[2] is used for extracting code clone across different application – We assume that CnP within application will not cause license problems • Filtering – Excluding code clones generated by other than CnP Ex. getter/setter, variable declarations • Directions of CnP are undecided Application X Application Y Application Z License A License B License C CnP CnP Variable Declarations Getter/Setter [2] T. Kamiya, S. Kusumoto and K. Inoue: “CCFinder: A multilinguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28, pp. 654–670 (2002) Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 10 Step3. Counting Code Clones(1/2) • Repeating the following steps to target licenses 1. Select a license as an analysis target 2. Extract clone sets including the license code • Clone set is a set of code clones similar to each other 3. Count code fragments in extracted clone sets Fragments having grouped by their license CnP relations Application X Application Y License A License B Application Z License C to License A code License #Code Fragments License A 2 License B 1 License C 2 11 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Step3. Counting Code Clones(2/2) • A clone set including both original code fragments and code fragments generated by CnP → Counting code fragments in clone sets approximates counting the number of CnP • Counting the number of CnP to/from target license code fragments • Although this table includes the CnP of opposite direction, it is enough to understand the brief of summary Application X Application Y License A License B Application Z License C Fragments having CnP relations to License A code License #Code Fragments License A 2 License B 1 License C 2 12 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Analyzed Code • Java files(.java) in Debian GNU/Linux 5.0.2 main section • Reasons for selecting this target – consisted of various licenses – enable to be analyzed by both Ninka and CCFinder – an feasible scale for this experiment #Packages #Files LOC 452 77,452 8,530,896 13 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University License Distribution in Analyzed Code #Files 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 14 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Result(BSD3) • Result of counting code fragments in clone sets including BSD3 fragments grouped by their license • The frequency of license used by code fragments having CnP relationship to BSD3 fragments License BSD3 #Fragments Percentage 613 92% GPLv2+ 20 3.0% Apachev2 16 2.4% LesserGPL2+ 14 2.1% GPLv2,ClassPathException 1 0.15% LesserGPL2.1+ 1 0.15% • BSD3 code is mostly reused by BSD3 code 15 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Result(Apachev2) License Apachev2 Percentage #Fragments 1533 77% 316 16% LesserGPL2.1+ 42 2.1% MPLv1.1 33 1.6% BSD3 29 1.5% MX4JLicensev1 16 0.80% GPLv2+ 4 0.20% LibraryGPL2+ 3 0.15% MPLv1.0 2 0.10% MITX11noNotice 2 0.10% Public Domain 1 0.050% Subversion+ 1 0.050% EPLv1 1 0.050% Apachev1.1 • Large percentage of CnP between Apachev2 code fragments • Apachev1.1 code has been changed their license to Apachev2 16 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Result(GPLv2+) License #Fragments Percentage GPLv2+ 268 44% GPLnoVersion,GPLv2+,LinkException 225 41% BSD3 28 5.1% LibraryGPLv2+ 20 3.6% Apachev2 4 0.73% LesserGPLv2.1+ 4 0.73% • CnP within GPLv2+ code occupy the highest percentage • “GPLnoVersion, GPLv2+, LinkException” has high percentage • “GPLnoVersion, GPLv2+, LinkException” code is reused by GPLv2+ code. GPLnoVersion, GPLv2+, LinkException GPLv2+ CnP CnP Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 17 #Files and #Fragments under Each License • Code under a license is copy-and-pasted frequently, if “#Fragments / #Files” of the license is large #Fragments BSD3 Apachev2 GPLv2+ #Files #Fragments / #Files 665 2181 0.305 1983 16350 0.121 549 8160 0.0673 • The frequency of CnP per file BSD3 > Apachev2 > GPLv2+ 18 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Summary of the Results • Common characteristic of all licenses – CnP within code distributed under same license or licenses designed by the same organization have a majority • CnP might happen mostly in an organization • Apachev2 has CnP relations to various licenses – Files under Apachev2 have the largest number – The condition of Apachev2 is more relaxed than that of GPLv2+ • The frequency of CnP per file BSD3 > Apachev2 > GPLv2+ 19 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Threat to Validity • Insufficient to apply this result to general OSS – This analysis target is small → We plan large scale analysis – Only Java files were analyzed • History of Java files is short, hence Java files are less copyand-pasted than others → We plan analysis of C/C++ files • Overlap code fragments may be counted separately – Number of overlap code fragments might be small Fragment A Fragment B 20 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Scalability of Investigating Method • This method can apply to large target, because each step can – License detection • Ninka can analyze files in linear order – Code clone detection • There are more scalable tools than CCFinder such as CCFinderX and D-CCFinder. – Counting code clone • This process did not take a long time 21 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Conclusion • A preliminary study of impact of licenses on CnP was performed – Java files in Debian/GNU Linux 5.0.2 main section were analyzed • CnP are happened mostly within code distributed under the same license or licenses designed by the same organization • The frequency of CnP per file – BSD3 > Apachev2 > GPLv2+ • Our method can be applied to a large target 22 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Future Work • Large Scale Experiment • Investigating that code fragments are copy-and-pasted mostly in an organization • Detecting direction of CnP 23 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University