An Empirical Study of the Relationship Between Code Bad Smells and Software Faults Min Zhang School of Computer Science University of Hertfordshire Introduction What is a Code Bad Smell? Problems using Code Bad Smells An overview of the empirical study Code Bad Smell detection Fault identification Result and discussion Conclusion Q/A Code Bad Smells The 22 Code Bad Smells are bad structures in source code informally identified by Fowler et al. (1999). Fowler et al. (1999) suggest that Code Bad Smells can give “indications that there is trouble that can be solved by a refactoring”. They are widely used for detecting refactoring opportunities in software (Mens and Tourwe, 2004). Problems in Using Code Bad Smells Fowler et al. (1999) claim that Code Bad Smells are structures which cause detrimental effects on software. However, little empirical evidence has been provided. Most existing Code Bad Smell detection tools are Metric-based. We argue about their accuracy. An Empirical Study of the Relationship between Code Bad Smells and Faults Objective: Capture the relationship between Code Bad Smells and faults Targeted Code Bad Smells: Data Clumps, Message Chains, Middle Man, Speculative Generality, and Switch Statements Research Data: Eclipse Core Packages (Release 3.0, 3.0.1, 3.0.2, 3.1 and 3.2) Apache Common Packages (Common IO, Common Logging, Common Codec, Common DbUtils, Common DBCP, and Common Net ) Code Bad Smell Detection Pattern-based Code Bad Smell detection Define each Code Bad Smell as particular code patterns Ideas from Gamma et al.’s (1995) definition of the GoF Design Patterns Use Recoder API to analyse Java source code An Example: The Pattern-based Definition of the Message Chains Bad Smell The Pattern-based Definition of the Message Chains Bad Smell Fowler et al.’s definition You see message chains when a client asks one object for another object, which the client then asks for yet another object, which the client then asks for yet another another object, and so on. You may see these as a long line of getThis methods, or as a sequence of temps. (Fowler et al., 1999) Pattern-based definition An instance of the Message Chains Bad Smell is in one of the following situations: Situation 1: 1.In order to access a data field in another class, a statement needs to call more than a threshold value of getter methods in a sequence. (E.g. int a=b.getC().getD();) 2.This method call statement and the declarations of getter methods are in different classes. Situation 2: 1.In order to access a data field in another class, source code use more than a threshold number of temp variable. 2.A temp variable is that a variable only access data members (data fields/getter methods) of the other classes or other temp variables. (E.g. ClassC tmpC=b.getC(); int a=a1.getD();) Fault Identification Zimmerman et al.’s (2007) fault identification approach: 1. 2. 3. 4. Locate “bug”, “fix(ed)” and “update(d)” token in CVS comment messages. If a version entry in CVS contains one or more above tokens and those tokens are followed by numbers, this version entry is seen as a bug fixing update. Those numbers are treated as bug ID. Confirm the bug ID using Bugzilla database. Results and Discussion: Binary Coding of the Existence of Code Bad Smells (1) Existence of Code Bad Smells Data Clumps Message Chains Speculative Generality Middle Men Switch Statements Coding 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 2 1 1 0 0 0 3 0 0 1 0 0 4 1 0 1 0 0 5 0 1 1 0 0 6 1 1 1 0 0 7 0 0 0 1 0 8 1 0 0 1 0 9 0 1 0 1 0 10 1 1 0 1 0 11 0 0 1 1 0 12 1 0 1 1 0 13 0 1 1 1 0 14 1 1 1 1 0 15 Result and Discussion: Binary Coding of the Existence of Code Bad Smells (2) Existence of Code Bad Smells Data Clumps Message Chains Speculative Generality Middle Men Switch Statements Coding 0 0 0 0 1 16 1 0 0 0 1 17 0 1 0 0 1 18 1 1 0 0 1 19 0 0 1 0 1 20 1 0 1 0 1 21 0 1 1 0 1 22 1 1 1 0 1 23 0 0 0 1 1 24 1 0 0 1 1 25 0 1 0 1 1 26 1 1 0 1 1 27 0 0 1 1 1 28 1 0 1 1 1 29 0 1 1 1 1 30 1 1 1 1 1 31 Result and Discussion: One-way Analysis of Variance Eclipse Data (1) Result and Discussion: One-way Analysis of Variance Eclipse Data (2) The five profiles which indicate the existence of each of the five Code Bad Smells contain significantly lower mean number of faults than profile zero. All profiles which have higher mean number of faults than profile zero contain the Message Chains and the Switch Statement Bad Smells. Result and Discussion: the Message Chains and Switch Statements Result and Discussion: the Message Chains and Switch Statements All source code samples associated with more than 10 faults contain the Message Chains Bad Smell. The Switch Statements Bad Smell does not show a clear relationship with high number of faults. Result and Discussion: One-way Analysis of Variance Apache Data (1) Result and Discussion: One-way Analysis of Variance Apache Data (2) The five profiles which indicate the existence of each of the five Code Bad Smells contain lower mean number of faults than profile zero. All the Message Chains Bad Smell contained profiles do not show higher mean number of faults than the profile zero. A Detailed Investigation of Message Chains Objective: To test whether the Message Chains Bad Smell is directly associated with faults. To test whether the Message Chains Bad Smell is directly associated with particular types of faults. Method: Manually investigate 20 source code samples from the Eclipse project An Detail Investigation of Message Chains: Direct Association with Faults Association Type Detail of Change Number of Instances Message Chains Touched During Fix Message Chains Increased 4 Message Chains Reduced 5 Message Chains Not Touched During Fix Total 45 54 A Detailed Investigation of Message Chains: Fault Classification Classification Schema: An adopted version of Seaman et al.’s (2008) fault classification schema Results: Type of Fault Number of Instances Algorithm / Method 4 Checking 1 External Interface 2 Internal Interface 2 Non-functional Defects 0 Other 0 A Detailed Investigation of Message Chains: Result Message Chains Bad Smell is not likely to be directly associated with faults, but it indicates a complicated software context. Message Chains Bad Smell is likely to be associated with Algorithm/Method faults. Conclusion Source code containing only one of the five Code Bad Smells is not likely to be fault prone. The Message Chains Bad Smell could cause a high number of faults and is likely to be associated with Algorithm/Method faults, so it deserves further attention. The Message Chains Bad Smell may not be directly associated with faults but it may indicate a complicated software context. Q/A References FOWLER, M., BECK, K., BRANT, J., OPDYKE, W. & ROBERTS, D. (1999) Refactoring: Improving the Design of Existing Code, Addison Wesley. GAMMA, E., HELM, R., JOHNSON, R. & VLISSIDES, J. (1995) Design patterns : elements of reusable object-oriented software, Reading, Mass., Addison-Wesley. MENS, T. & TOURWE, T. (2004) A survey of software refactoring. Software Engineering, IEEE Transactions on, 30, 126-139. SEAMAN, C. B., SHULL, F., REGARDIE, M., ELBERT, D., FELDMANN, R. L., GUO, Y. & GODFREY, S. (2008) Defect categorization: making use of a decade of widely varying historical data. Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement. Kaiserslautern, Germany, ACM. ZIMMERMANN, T., PREMRAJ, R. & ZELLER, A. (2007) Predicting Defects for Eclipse. IN PREMRAJ, R. (Ed.) Predictor Models in Software Engineering, 2007. PROMISE'07: ICSE Workshops 2007. International Workshop on.