Randomness and testing Analysis, Lecture 12 Claire Le Goues February 24, 2015 (c) 2015 C. Le Goues 1 Learning goals • Define random testing, describe its benefits and drawbacks, contrast with partition testing, and enumerate an example tool. • Describe fuzz testing, including its technical distinction from regular random testing, and the defects its particularly well suited to find. • Explain how randomness can help with testing for robustness, usability, integration, and performance testing. • Define mutation testing and explain why it’s useful. (c) 2015 C. Le Goues 2 Switch statements and V(G) • …a more complicated question than you’d think. • Short answer: # of cases. • Long answer: counting switch statements in terms of actual number of edges can lead to misleadingly high complexity numbers. • McCabe suggested just ignoring them. (c) 2015 C. Le Goues 3 Random testing • Testing: verb – The more or less thorough execution of the software with the purpose of finding bugs before the software is released for use and to establish that the software performs as expected. • Random: adjective 1. proceeding, made, or occurring without definite aim, reason, or pattern: the random selection of numbers. 2. Statistics. of or characterizing a process of selection in which each item of a set has an equal probability of being chosen. 2014, Dictionary.com (c) 2015 C. Le Goues 4 In a nutshell • Select inputs independently at random from the program’s input domain: – Identify the input domain of the program. – Map random numbers to that input domain. – Select inputs from the input domain according to some probability distribution. – Determine if the program achieves the appropriate outputs on those inputs. • Random testing can provide probabilistic guarantees about the likely faultiness of the program. – E.g., Random testing using ~23,000 inputs without failure (N = 23, 000) establishes that the program will not fail more than one time in 10,000 (F = 104), with a confidence of 90% (C = 0.9). (c) 2015 C. Le Goues 5 Random testing The systematic variation of values through the input space with the purpose of identifying abnormal output patterns. When such patterns are identified a root cause analysis is conducted to identify the source of the problem. In this case the “state 3” outputs seem to be missing When Only Random Testing Will Do Dick Hamlet, 2006 (c) 2015 C. Le Goues 6 Why use random testing? • It’s cheap, assuming you solve the oracle problem. – You may need more tests to find the same number of faults, but generating tests is very easy. • Can calculate the reliability of an application using established probability theory (next slide). – Can augment release criteria. E.g., “no random failures for 3 weeks prior to a release under a given random testing protocol.” • Good when: – Lack of domain knowledge makes it difficult or meaningless to partition the input space into equivalence classes. – When the amount of state information is important. – When large volume of data are necessary, such as for load testing, stress testing, robustness testing, or reliability calculations. • Useful complement to partition testing. (c) 2015 C. Le Goues 7 When to use Random Testing • Lack of domain knowledge makes it difficult or meaningless to partition input into equivalence classes. • The amount of state information is important. • When large volume of data are necessary: – Load testing – Stress testing – Robustness testing – Reliability calculations • To complement partition testing (c) 2015 C. Le Goues 8 Mathematical reliability • Assume program P has a constant failure rate of q. – The probability that P will fail a given test is q ; that it will succeed, 1- q . • On N independent tests: – probability of universal success: (1- q ) N – probability of at least one failure: e =1- (1- q ) N • 1- e is the confidence probability that a failure will occur no more often than once in 1 runs. q 1 • Solve for q , which is also the mean time to failure: 1 q ³ 1 1- (1- e)1/N • The number of tests required to attain confidence 1- e in this MTTF: log(1- e) log(1- q ) (c) 2015 C. Le Goues 9 Challenges • Selecting or sampling from the input space. – – – – Uniform distribution: uniform random selection Equispaced: unsampled gaps are the same size Operational profile (only makes sense at the system level) Proportional sampling: sample according to subdomain distribution (partition the input space) – Adaptive sampling: take the pattern of previously-identified failure-causing inputs into consideration in sample strategy. • The oracle problem: a test case is an input, an expected output, and a mechanism for determining if the observed output is consistent with the expect output. • Root cause analysis: how to debug using a randomly generated input (c) 2015 C. Le Goues 10 1. if((((l_421 || (safe_lshift_func_uint8_t_u_u (l_42 1, 0xABE574F6L))) && 2. (func_77(func_38((l_424 >= l_425), g_394, g_30.f0), func_8(l_408, g_345[2], 3. g_7, (*g_165), l_421), (*l_400), func_8(((*g_349) != (*g_349)), (l_426 != 4. (*l_400)), (safe_lshift_func_int16_t_s_u((**g_349), 0xD5C55EF8L)), 0x0B1F0B62L, 5. g_95), (safe_add_func_uint32_t_u_u((*g_165),l_431)))^ 6. ((safe_rshift_func_uint8_t_u_s(((*g_165)>=(**g_349)),(safe_mul_func_int8_t_s_s 7. ((*g_165), l_421)))) <= func_77((*g_129), g_95, 1L, l_408, (*l_400))))){ 8. struct S0 *l_443 = &g_30; 9. (*l_400) = ((safe_mod_func_int16_t_s_s((safe_add_func_int16_t_s_s(l_421, 10. (**g_164))), (**g_349))) && l_425); 11.l_447^=(safe_sub_func_int16_t_s_s (0x27AC345CL, ((**g_250) <= 12. func_66(l_446, g_19, g_129, (*g_129), l_407)))); 13.(*l_446)=func_22(l_431,-1L,l_421,(0x1B625347L<=func_22(g_394,l_447, -1L))); 14.} else { 15.const uint32_t l_459 = 0x9671310DL; 16.l_448 = (*g_186); 17.(*l_400) = (0L & (0 == (*g_348))); 18.(*l_400) = func_77((*g_31),((*g_165) && 6L), l_426, func_77((*l_441), 19. (safe_lshift_func_uint16_t_u_u ((((safe_mul_func_int16_t_s_s ((**g_349), 20. (*g_165))) | ((*g_165) > l_426)) < (0 != (*g_129))), (&l_431 == &l_408))), 21. (l_453 == &l_407), func_77(func_38((*l_400), (safe_mod_func_uint16_t_u_u 22. ((l_420 < (*g_165)), func_77((*l_441), l_456, (*l_446), (*l_448), g_345[5]))), 23. g_345[4]), g_287, (func_77((*g_129), l_421, (l_424 & (**g_349)), ((*l_453) != 24. (*g_129)), 0x6D4CA97DL) == (safe_div_func_int64_t_s_s (-1L, func_77((*g_129), 25. l_459, l_447, (*l_446), l_459)))), g_95, g_19), l_420), (*l_446)); (c) 2015 C. Le Goues 11 26.} Solutions to the oracle problem Parameters Input generator Fail SUT Parameters Comparator Normal Pass Input generator SUT Observer Crash Golden standard Parameters Input generator Exception Fail SUT Comparator Parameters Assertions Input generator SUT Pass Parametric oracle (c) 2015 C. Le Goues 12 Pass Fails Parametric oracle example A Technique for Testing Command and Control Software, M. Watkins, Boeing, 1982 (c) 2015 C. Le Goues 13 Criticisms of random testing • Oracle problem • Corner faults might escape detection • Prediction problems – Uniform-distribution based predictions can be easily wrong – Operational profiles are difficult to obtain and what suits one user might not suite another, e.g., "novice" and "expert" – Domain based analysis does not account for program size, only for the number of test points (c) 2015 C. Le Goues 14 The random vs. partition testing debate (c) 2015 C. Le Goues 15 Metrics • P-measure: probability of finding at least one failing test case. – Pr =1- (1- q )N k – Pp =1- Õ (1- qi )n i • i=1 q i probability that an input from partition i will cause a failure in that partition; ni number of test cases drawn from that partition; k number of partitions • E-measure: expected number of triggered failures: E = å nq E = N q r F-measure: expected number of test cases required to trigger an k • p i=1 i i error. (c) 2015 C. Le Goues 16 Mathematical implications • Partition testing: guaranteed to perform at least as well as random if number of test cases is proportional to size of subdomain. – Ratios of subdomain sizes may be too large, test case counts thus infeasible. • E-measure is same when the failure rates of all partitions are the same; better in partitions strategy if we focus on the buggy partitions; better in random strategy otherwise. • P-measure better on partitions if the partitions are all the same size and we choose the same # of test cases from each partition. (c) 2015 C. Le Goues 17 The discussion: Part 1 • 1984, Duran & Ntafos. “Simulation results are presented which suggest that random testing may often be more cost effective than partition testing schemes” • 1991, Weyuker & Jeng. “We have shown analytically that partition testing can be an excellent testing strategy or a poor one” … “For a partition testing strategy to be really effective and therefore worth doing…it is necessary that some subdomains be relatively small and contain only failure causing inputs, or at least nearly so” • 1994, Chen & Yu. “Partition testing is guaranteed to perform at least as well as random testing so long as the number of test cases selected is in proportion to the size of the subdomains” • 1996, Reid. “As expected, an implementation of BVA was found to be most effective, with neither EP nor random testing half as effective. The random testing results were surprising, requiring just 8 test cases per module to equal the effectiveness of EP, although somewhere in the region of 50,000 random test cases were required to equal the effectiveness of BVA.” (c) 2015 C. Le Goues 18 The Discussion: Part 2 • 1998, Ntafos. “Proportional partition testing has been suggested as a preferred way to perform partition testing because it assures performance that is at least as good as that of random testing. We showed that this goal for partition testing is rather questionable and even counterproductive. Partition testing strategies can be expected to perform better than random testing; the real issue is cost-effectiveness. Random testing may be a good complementary strategy to use especially for final testing. It has the advantage of relatively easy reliability estimation from test outcomes.” • Gutjahr, 1999. Even if no especially error-prone subdomains of the input domain can be identified in advance, partition testing can provide substantially better results than random testing. • 2003, Boland et al. For equal sample sizes from all the subdomains, partition testing is superior to random testing if the average of the subdomain failure rates is larger than the overall failure rate of the program. (c) 2015 C. Le Goues 19 Random vs. partition summary • Most comparisons made on the basis of the methods’ ability to detect at least one fault. • Comparisons refer to “partition testing” in general. • General agreement that “fault” oriented partitions work better than other type of partitions when judged from the fault detection perspective. • Fault detection ability is not the only thing that counts. Bear in mind the assumptions behind the math! – The non-overlapping subdomains assumption used in most studies ignores the reality of common code across them. – Most comparisons assume the same number of test cases in both instances. – Most also assume the existence of an oracle. • Consider the cost effectiveness of a technique for your particular domain/testing problem. Do you have an automated oracle? What is your computational/human budget? (c) 2015 C. Le Goues 20 Randoop • Feedback-directed random testing – Inputs • Location of an assembly, • A time limit after which test generation stops, • An optional set of configuration files – what should be tested or avoided for not rediscovering the same error. – Generates unit tests (sequences) and checks for assertion violations, access violations, and unexpected program termination – Before outputting an error-revealing sequence, Randoop attempts to minimize it by iteratively omitting method calls that can be removed from the method sequence while preserving its error-revealing behavior (c) 2015 C. Le Goues 21 AgitarOne M.t Boshernitsan, R. Doong, A. Savoia , .From daikon to agitator: lessons and challenges in building a commercial tool for developer testing, 2006 (c) 2015 C. Le Goues 22 Fuzz testing • Fuzz testing is a negative software testing method that feeds malformed and unexpected input data to a program, device, or system with the purpose of finding security-related defects, or any critical flaws leading to denial of service, degradation of service, or other undesired behavior (A. Takanen et al, Fuzzing for Software Security Testing and Quality Assurance, 2008) • Programs and frameworks that are used to create fuzz tests or perform fuzz testing are commonly called fuzzers. (c) 2015 C. Le Goues 23 Fuzzing process (c) 2015 C. Le Goues 24 Fuzzing approaches • Generic: crude, random corruption of valid data without any regard to the data format. • Pattern-based: modify random data to conform to particular patterns. For example, byte values alternating between a value in the ASCII range and zero to “look like” Unicode. • Intelligent: uses semi-valid data (that may pass a parser/sanity checker’s initial line of defense); requires understanding the underlying data format. For example, fuzzing the compression ratio for image formats, or fuzz PDF header or cross-reference table values. • Large Volume: fuzz tests at large scale. The Microsoft Security Development Lifecycle methodology recommends a minimum of 100,000 data fuzzed files. • Exploit variant: vary a known exploitative input to take advantage of the same attack vector with a different input; good for evaluating the quality of a security patch. (c) 2015 C. Le Goues 25 Some studies • An Empirical Study of the Reliability of UNIX Utilities, B. Miller et al, 1990 – We have been able to crash 25-33% of the utility programs on any version of UNIX that was tested • An Empirical Study of the Robustness of Windows NT Applications Using Random Testing, J. Forrester & B. Miller, 2000 – When subjected to random valid input that could be produced by using the mouse and keyboard, we crashed 21% of applications tested (including Microsoft Office 97 and 2000, Adobe Acrobat Reader, Eudora, Netscape 4.7, Visual C++ 6.0, Internet Explorer (IE) 4.0 and 5.0), and hung an additional 24% of applications. When subjected to raw random Win32 messages, we crashed or hung all the applications that we tested • An Empirical Study of the Robustness of MacOS Applications Using Random Testing, B.Miller et al, 2006 – Our testing crashed only 7% of the command-line utilities, a considerably lower rate of failure than observed in almost all cases of previous studies. We found the GUI-based applications to be less reliable: of the thirty that we tested, only eight did not crash or hang. Twenty others crashed, and two hung. These GUI results were noticeably worse than either of the previous Windows (Win32) or UNIX (X-Windows) studies (c) 2015 C. Le Goues 26 Types of faults found • • • • • • • Pointer/array errors Not checking return codes Invalid/out of boundary data Data corruption Signed characters Race conditions Undocumented features (c) 2015 C. Le Goues 27 Fuzzers • • • • • • • • • • • AxMan—A web-based ActiveX fuzzing engine Blackops SMTP Fuzzing Tool—Supports a variety of different SMTP commands and Transport Layer Security (TLS) BlueTooth Stack Smasher (BSS)—L2CAP layer fuzzer, distributed under GPL license COMRaider—COMRaider is a tool designed to fuzz COM Object Interfaces Dfuz—A generic fuzzer File Fuzz—A graphical, Windows based file format fuzzing tool. FileFuzz was designed to automate the creation of abnormal file formats and the execution of applications handling these files. FileFuzz also has built in debugging capabilities to detect exceptions resulting from the fuzzed file formats Fuzz—The original fuzzer developed by Dr. Barton Miller at my Alma Matter, the University of Wisconsin-Madison in 1990. Go badgers! fuzzball2—TCP/IP fuzzer radius fuzzer—C-based RADIUS fuzzer written by Thomas Biege ip6sic—Protocol stressor for IPv6 Mangle—A fuzzer for generating odd HTML tags, it will also auto launch a browser. • • • • • • • • • PROTOS Project—Software to fuzz Wireless Application Protocol (WAP), HTTP, Lightweight Directory Access Protocol (LDAP), Simple Network Management Protocol (SNMP), Session Initiation Protocol (SIP), and Internet Security Association and Key Management Protocol (ISAKMP) Scratch—A protocol fuzzer SMUDGE—A fault-injector for many different types of protocols and is written in the python language. SPIKE—Network protocol fuzzer SPIKEFile—Another file format fuzzer for attacking ELF (Linux) binaries from iDefense. Based off of SPIKE listed above. SPIKE Proxy—Web application fuzzer Tag Brute Forcer—Awesome fuzzer from Drew Copley at eEye for attacking all of those custom ActiveX applications. Used to find a bunch of nasty IE bugs, including some really hard to reach heap overflows beSTORM—Performs a comprehensive analysis, exposing security holes in your products during development and after release. Hydra—Hydra takes network fuzzing and protocol testing to the next level by corrupting traffic intercepted “on the wire,” transparent to both the client and server under test M. Warnock, Look out! It’s the fuzz!, 2007 (c) 2015 C. Le Goues 28 VARIATIONS ON A (RANDOM) THEME… (c) 2015 C. Le Goues 29 (c) 2015 C. Le Goues 30 Chaos monkey/Simian army • A Netflix infrastructure testing system. • “Malicious” programs randomly trample on components, network, datacenters, AWS instances… – Chaos monkey was the first – disables production instances at random. – Other monkeys include Latency Monkey, Doctor Monkey, Conformity Monkey, etc… Fuzz testing at the infrastructure level. – Force failure of components to make sure that the system architecture is resilient to unplanned/random outages. • Netflix has open-sourced their chaos monkey code. (c) 2015 C. Le Goues 31 Usability: A/B testing • Controlled randomized experiment with two variants, A and B, which are the control and treatment. • One group of users given A (current system); another random group presented with B; outcomes compared. • Often used in web or GUI-based applications, especially to test advertising or GUI element placement or design decisions. (c) 2015 C. Le Goues 32 Example • A company sends an advertising email to its customer database, varying the photograph used in the ad... (c) 2015 C. Le Goues 33 Example: group A (99% of users) • Act now! Sale ends soon! (c) 2015 C. Le Goues 34 Example: group B (1%) • Act now! Sale ends soon! (c) 2015 C. Le Goues 35 Example • A company sends an advertising email to its customer database, varying the photograph used in the ad... • If more customers in the cat group than the dog group respond to the advertisement, this indicates a possibly fruitful marketing direction. (c) 2015 C. Le Goues 36 Integration: object protocols • Covers the space of possible API calls, or program “conceptual states.” • Develop test cases that involve representative sequence of operations on objects – Example: Dictionary structure: Create, AddEntry*, Lookup, ModifyEntry*, DeleteEntry, Lookup, Destroy – Example: IO Stream: Open, Read, Read, Close, Read, Open, Write, Read, Close, Close – Test concurrent access from multiple threads • Example: FIFO queue for events, logging, etc. Create Put Put Put Get Get Get Get Put Put Get • Approach – Develop representative sequences – based on use cases, scenarios, profiles – Randomly generate call sequences • Also useful for protocol interactions within distributed designs. (c) 2015 C. Le Goues 37 Stress testing • Robustness testing technique: test beyond the limits of normal operation. • Can apply at any level of system granularity. • Stress tests commonly put a greater emphasis on robustness, availability, and error handling under a heavy load, than on what would be considered “correct” behavior under normal circumstances. (c) 2015 C. Le Goues 38 Soak testing • Problem: A system may behave exactly as expected under artificially limited execution conditions. – E.g., Memory leaks may take longer to lead to failure (also motivates static/dynamic analysis, but we’ll talk about that later). • Soak testing: testing a system with a significant load over a significant period of time (positive). • Used to check reaction of a subject under test under a possible simulated environment for a given duration and for a given threshold. (c) 2015 C. Le Goues 39 Mutation testing • Technique to evaluate the quality of a (typically black box) test suite. • Creates many random mutants of a program and then runs the test cases on them. • If any of the test cases now fail, the test suite has “killed” the mutant. • Test suite quality is measured by the number/percentage of killed mutants. (c) 2015 C. Le Goues 40 Solid boxes are automatic; dashed are manual, though research continues on identifying equivalent mutants automatically and automatically generating test cases that kill previously unkilled mutants. A practical system for mutation testing: help for the common programmer. J. Offut, 1994 (c) 2015 C. Le Goues 41 Example mutation operators • Delete a statement • Change operators (example: && to ||) • Replace expressions with true or false • Replace a variable with another variable. • …Considerable research in more/better/more powerful mutation types. (c) 2015 C. Le Goues 42 Example tools • Java source code: – – – – • Jester muJava Bacterio Judy Java byte code: – Javalanche – Jumble – PIT • Ruby: – Mutant – Heckle • .NET and C#: – NinjaTurtles – Nester • Php: Mutagenesis (c) 2015 C. Le Goues 43 Duran & Ntafos, 1984. “Simulation results are presented which suggest that random testing may often be more cost effective than partition testing schemes” (c) 2015 C. Le Goues 44 50 trials Weyuker & Jeng, 1991. “We have shown analytically that partition testing can be an excellent testing strategy or a poor one” • can be better, worse, or the same as , depending on how the partitioning is performed (c) 2015 C. Le Goues 45 Weyuker & Jeng, 1991. “For a partition testing strategy to be really effective and therefore worth doing, it clearly has to perform substantially better than in these examples. It is therefore necessary that some subdomains be relatively small and contain only failure causing inputs, or at least nearly so” • when one or more partitions only contains inputs that produce incorrect outputs. In most cases • is minimized when a dominant partition contains all the failure causing inputs and is assigned just one test case. In most cases, (c) 2015 C. Le Goues 46 Proportional testing • Chen & Yu, 1994. “We also find that partition testing is guaranteed to perform at least as well as random testing so long as the number of test cases selected is in proportion to the size of the subdomains” “There is, however, a practical consideration that affects the applicability of Proposition 3. The exact ratios of the subdomain sizes may not be reducible to the ratios of small integers, thereby rendering the total number of test cases too large” • Ntafos, 1998. “Proportional partition testing has been suggested as a preferred way to perform partition testing because it assures performance that is at least as good as that of random testing. We showed that this goal for partition testing is rather questionable and even counterproductive. Partition testing strategies can be expected to perform better than random testing; the real issue is cost-effectiveness. Random testing may be a good complementary strategy to use especially for final testing. It has the advantage of relatively easy reliability estimation from test outcomes” • Gutjahr, 1999 (c) 2015 C. Le Goues 47 1996, S. Reid. Empirical Comparison of Random, Equivalence Classes and Boundary Value Analysis Number of test cases required Probability of Faults detection detected Equivalence Classes 8 .33 6 Random 8 .32 5 Boundary Value Analysis 25 .73 12 Random 25 .37 6 An Empirical Analysis of Equivalence Partitioning, Boundary Value Analysis and Random Testing Stuart, C. Reid, 1997 (c) 2015 C. Le Goues 48 Chen & Yu, 1996. Overlapping subdomains & the use of Pmeasure to evaluate test •effectivenes Another merit of the E-measure is that it can distinguish the capability of detecting more than one failure, while the Pmeasure regards a testing strategy as good as another so long as both can detect at least one failure • For the overlapping case, we notice that the crucial factor of the relative performance of subdomain testing to random testing is the aggregate of the differences in the subdomain failure rate and the overall failure rate, weighted by the number of test cases selected from that subdomain. Thus, unlike the disjoint case, it is possible that all subdomain failure rates are higher than the overall, in which case subdomain testing is clearly better than random testing (c) 2015 C. Le Goues 49 Gutjahr, 1999. The influence of uncertainty • • • • This paper compares partition testing and random testing on the assumption that program failure rates are not known with certainty and should, therefore, be modeled by random variables It is shown that under uncertainty, partition testing compares more favorably to random testing than suggested by prior investigations concerning the deterministic case: The restriction to failure rates that are known with certainty systematically favors random testing. The case above is a boundary case (the worst case for partition testing), and the fault detection probability of partition testing can be up to k times higher than that of random testing, where k is the number of subdomains Finally, let us briefly summarize consequences of our results for the work of a practicing test engineer: – – – œ In spite of (erroneous) conclusions that might possibly be drawn from previous investigations, partition– based testing techniques are well-founded. Even if no especially error-prone subdomains of the input domain can be identified in advance, partition testing can provide substantially better results than random testing. œ Because of the close relations between partition testing and other subdomain-based testing methods (branch testing, all-uses, mutation testing etc.), also the superiority of the last-mentioned methods over random testing can be justified. The wide-spread practice of spending effort for satisfying diverse coverage criteria instead of simply choosing random test cases is not a superstitious custom; it is a procedure the merits of which can be understood by sufficiently subtle, but formally precise models. œ The effort for satisfying partition-based coverage criteria is particularly well spent, whenever the partition leads to subdomains of largely varying sizes, each of which is processed by the program or system in a rather homogeneous way (i.e., the processing steps are similar for all inputs of a given subdomain). Contrary, the advantages of partition testing are only marginal in the case of subdomains of comparable sizes and heterogeneous treatment by the program. In any case, the partition should not be arbitrarily chosen, but carefully derived from the structure or function of the program. (c) 2015 C. Le Goues 50 Boland et al, 2003. Comparing partition and random testing via majorization and Schur functions • • • • • • • We establish a general result that states that, for equal sample sizes from all the subdomains, partition testing is superior to random testing if the average of the subdomain failure rates is larger than the overall failure rate of the program. This general result helps in identifying many situations where partition testing will be more effective than random testing, giving strength to the partition testing approach This generalizes the result established by Gutjahr, which states that, for samples of size one from each subdomain, partition testing is better or same as random testing if all subdomain failure rates have the same expected value The most important results of our analysis are the following: For equal sample sizes from all the subdomains, partition testing outperforms random testing if the average of the subdomain failure rates is larger than the overall failure rate of the program. Throughout this paper, the (sub)-domain failure rate is defined as the ratio of the number of failure causing inputs in the (sub)domain to the size of the (sub)domain. For equal sample sizes from all the subdomains, partition testing is superior if the subdomain failure rates are inversely proportional to subdomain size. For unequal sample sizes, if and then partition testing is superior to random testing if the average of the subdomain failure rates is larger than the overall failure rate of the program In cases when the number of failure causing inputs in subdomains are assumed to be random variables (as in Gutjahr) and samples of size one from each subdomain are taken, partition testing is better or the same as random testing if the average of expected subdomain failure rates is larger than the expected failure rate of the program. (c) 2015 C. Le Goues 51