Test vs. inspection Part 2 Tor Stålhane Testing and inspection A short data analysis Test and inspections – some terms First we need to understand two important terms – defect types and triggers. After this we will look at inspection data and test data from three activity types, organized according to type of defect and trigger. We need the defect categories to compare test and inspections – where is what best? Defect categories This presentation uses eight defect categories: • Wrong or missing assignment • Wrong or missing data validation • Error in algorithm – no design change is necessary • Wrong timing or sequencing • Interface problems • Functional error – design change is needed • Build, package or merge problem • Documentation problem Triggers – 1 It is difficult to focus on several problem areas at the same time. It is practical to take one problem area - area of concern – at a time. We could for instance select “missing exceptions” as a trigger. In this case we would go through all the code and look for places where an exception would be and check that it has been inserted. Triggers – 2 In general we will use the term “trigger” for: • A goal to be achieved – e.g. understanding something • Something to be checked – e.g. conformance to a standard • Something to look for – e.g. side effects Triggers – 3 We will use different triggers for test and inspections. In addition – white box and black box tests will use different triggers. There is no definition of which terms that should be used for triggers. The triggers suggested in the following slides are examples of triggers that have worked well for others. Inspection triggers • Design conformance • Understanding details – Operation and semantics – Side effects – Concurrency • Backward compatibility – earlier versions of this system • Lateral compatibility – other, similar systems • Rare situations • Document consistency and completeness • Language dependencies Test triggers – black box • • • • Test coverage Sequencing – two code chunks in sequence Interaction – two code chunks in parallel Data variation – variations over a simple test case • Side effects – unanticipated effects of a simple test case Test triggers – white box • Simple path coverage • Combinational path coverage – same path covered several times but with different inputs • Side effect - unanticipated effects of a simple path coverage Testing and inspection – the V model Inspection data We will look at inspection data from three development activities: • High level design: architectural design • Low level design: design of subsystems, components – modules – and data models • Implementation: realization, writing code This is the left hand side of the V-model Test data We will look at test data from three development activities: • Unit testing: testing a small unit like a method or a class • Function verification testing: functional testing of a component, a system or a subsystem • System verification testing: testing the total system, including hardware and users. This is the right hand side of the V-model What did we find The next tables will, for each of the assigned development activities, show the following information: • Development activity • The three most efficient triggers First for inspection and then for testing Inspection – defect types Activity High level design Low level design Code inspection Defect type Documentation Function Interface Algorithm Function Documentation Algorithm Documentation Function Percentage 45.10 24.71 14.12 20.72 21.17 20.27 21.62 17.42 15.92 Inspection – triggers Activity High level design Low level design Code inspection Trigger Understand details Document consistency Backward compatible Side effects Operation semantics Backward compatible Operation semantics Document consistency Design conformance Percentage 34.51 20.78 19.61 29.73 28.38 12.16 55.86 12.01 11.41 Testing – triggers and defects Activity Trigger Percentage Test sequencing 41.90 Test coverage 33.20 Side effects 11.07 Activity Defect type Percentage Implementation testing Interface Assignments Build / Package / Merge Implementation testing 39.13 17.79 14.62 Some observations – 1 • Pareto’s rule will apply in most cases – both for defect types and triggers • Defects related to documentation and functions taken together are the most commonly found defect types in inspection – HLD: 69.81% – LLD: 41.44% – Code: 33.34% Some observations – 2 • The only defect type that is among the top three both for testing and inspection is “Interface” – Inspection - HLD: 14.12% – Testing: 39.13% • The only trigger that is among the top three both for testing and inspection is “Side effects” – Inspection – LLD: 29.73 – Testing: 11.07 Summary Testing and inspection are different activities. By and large, they • Need different triggers • Use different mind sets • Find different types of defects Thus, we need both activities in order to get a high quality product Inspection as a social process Inspection as a social process Inspections is a people-intensive process. Thus, we cannot consider only technical details – we also need to consider how people • Interact • Cooperate Data sources We will base our discuss on data from two experiments: • UNSW – three experiments with 200 students. Focus was on process gain versus process loss. • NTNU – two experiments – NTNU 1 with 20 students. Group size and the use of checklists. – NTNU 2 with 40 students. Detection probabilities for different defect types. The UNSW data The programs inspected were • 150 lines long with 19 seeded defects • 350 lines long with seeded 38 defects 1. Each student inspected the code individually and turned in an inspection report. 2. The students were randomly assigned to one out of 40 groups – three persons per group. 3. Each group inspected the code together and turned in a group inspection report. Gain and loss - 1 In order to discuss process gain and process loss, we need two terms: • Nominal group (NG) – a group of persons that will later participate in a real group but are currently working alone. • Real group (RG) – a group of people in direct communication, working together. Gain and loss -2 The next diagram show the distribution of the difference NG – RG. Note that the • Process loss can be as large as 7 defects • Process gain can be as large as 5 defects Thus, there are large opportunities and large dangers. Gain and loss - 3 12 10 8 Exp 1 6 Exp 2 Exp 3 4 2 0 7 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 Gain and loss - 4 If we pool the data from all experiments, we find that the probability for: • Process loss is 53 % • Process gain is 30 % Thus, if we must choose, it is better to drop the group part of the inspection process. Reporting probability - 1 1,00 0,90 0,80 0,70 0,60 RG 1 0,50 RG 2 RG 3 0,40 0,30 0,20 0,10 0,00 NG = 0 NG = ! NG = 2 NG > 2 Reporting probability - 2 It is a 10% probability of reporting a defect even if nobody found it during their preparations – group effect. It is a 80 % to 95% probability of reporting a defect that is found by everybody in the nominal group during preparations. Reporting probability - 3 The table and diagram opens up for two possible interpretations: • We have a, possibly silent, voting process. The majority decides what is reported from the group and what is not. • The defect reporting process is controlled by group pressure. If nobody else have found it, it is hard for a single person to get it included in the final report. A closer look - 1 The next diagram shows that when we have • Process loss, we find few new defects during the meeting but remove many • Process gain, we find, many new defects during the meeting but remove just a few • Process stability, we find and remove roughly the same amount during the meeting. New, retained and removed defects 50 45 40 35 30 RG > NG 25 RG = NG 20 RG < NG 15 10 5 0 Ne w Re ta ine d Re m o ve d A closer look - 2 It seems that groups can be split according to the following characteristics • Process gain – All individual contributions are accepted. – Find many new defects. • Process loss – Minority contributions are ignored – Find few new defects. A closer look - 3 A group with process looses is double negative. It rejects minority opinions and thus most defects found by just a few of the participants during: • Individual preparation. • The group meeting. The participants can be good at finding defects – the problem is the group process. The NTNU-1 data We had 20 students in the experiment. The program to inspect was130 lines long. We seeded 13 defects in the program. 1. We used groups of two, three and five students. 2. Half the groups used a tailored checklist. 3. Each group inspected the code and turned in an inspection report. Group size and check lists - 1 We studied two effects: • The size of the inspection team. Small groups (2 persons) versus large groups (5 persons) • The use of checklists or not In addition we considered the combined effect – the factor interaction. DoE-table Group size A Use of checklists B AXB Number of defects reported - - + 7 - + - 9 + - - 13 + + + 11 Group size and check lists - 2 Simple arithmetic gives us the following results: • Group size effect – small vs. large - is 4. • Check list effect – use vs. no use – is 0. • Interaction – large groups with check lists vs. small group without – is -2. Standard deviation is 1.7. Two standard deviations – 5% confidence – rules out everything but group size. The NTNU-2 data We had 40 students in the experiment. The program to inspect was130 lines long. We seeded 12 defects in the program. 1. We had 20 PhD students and 20 third year software engineering students. 2. Each student inspected the code individually and turned in an inspection report. Defect types The 12 seeded defects were of one of the following types: • Wrong code – e.g. wrong parameter • Extra code - e.g. unused variable • Missing code – e.g. no exception handling There was four defects of each type. How often is each defect found 0,90 0,80 0,70 0,60 0,50 low ex perienc e high ex perienc e 0,40 0,30 0,20 0,10 0,00 D3 D4 D8 D 10 D2 D5 D9 D 12 D1 D6 D7 D 11 Who finds what – and why First and foremost we need to clarify what we mean by high and low experience. • High experience – PhD students. • Low experience - third and fourth year students in software engineering. High experience, in our case, turned out to mean less recent hands-on development experience. Hands-on experience The plot shows us that: • People with recent hands-on experience are better at finding missing code • People with more engineering education are better at finding extra – unnecessary – code. • Experience does not matter when finding wrong code statements.